Recology

R/etc.

 

USDA plants database API in R

The USDA maintains a database of plant information, some of it trait data, some of it life history. Check it out at http://plants.usda.gov/java/

They’ve been talking about releasing an API for a long time, but have not done so.

Thus, since at least some version of their data is in the public web, I’ve created a RESTful API for the data:

Check out the API, and open issues for bugs/feature requests in the github repo.

The following is an example using it from R, but you can use it from anywhere, the command line, Ruby, Python, a browser, whatevs.

Here, we’ll use request, a higher level http client for R that I’ve been working on. A small quirk with request is that when piping, you have to assign the output of the request to an object before you can do any further manipulation. But that’s probably good for avoiding too long pipe chains.

note, that I’ve set tibble.max_extra_cols = 15 to not print the many columns that are returned, for blog post brevity. When you run below you’ll get more columns.

Install from CRAN

install.packages("request")

There is a small improvement in the dev version of request to make any data.frame’s tibble’s (which the below examples uses). To get that install from GitHub:

devtools::install_github("sckott/request")
library('request')
library('tibble')

Heartbeat

The simplest call to the API is to a route /heartbeat, which just lists the available routes.

Set the base url we’ll use throughout the work below

root <- api("https://plantsdb.xyz")
root %>% api_path(heartbeat)
#> $routes
#> [1] "/search (HEAD, GET)" "/heartbeat"

Okay, so there are just two routes, /search and /heartbeat.

The search route suppports the following parameters:

  • fields, e.g., fields='Genus,Species' (default: all fields returned)
  • limit, e.g., limit=10 (default: 10)
  • offset, e.g., offset=1 (default: 0)
  • search on any fields in the output, e.g, Genus=Pinus or Species=annua

Let’s first not pass any params

root %>% api_path(search)
#> $count
#> [1] 92171
#> 
#> $returned
#> [1] 10
#> 
#> $citation
#> [1] "USDA, NRCS. 2016. The PLANTS Database (http://plants.usda.gov, 12 July 2016). National Plant Data Team, Greensboro, NC 27401-4901 USA."
#> 
#> $terms
#> [1] "Our plant information, including the distribution maps, lists, and text, is not copyrighted and is free for any use."
#> 
#> $data
#> # A tibble: 10 × 134
#>       id Symbol Accepted_Symbol_x Synonym_Symbol_x
#> *  <int>  <chr>             <chr>            <chr>
#> 1      1   ABAB              ABAB                 
#> 2      2  ABAB2             ABPR3            ABAB2
#> 3      3  ABAB3              ABTH            ABAB3
#> 4      4 ABAB70            ABAB70                 
#> 5      5   ABAC             ABUMB             ABAC
#> 6      6   ABAL              ABAL                 
#> 7      7  ABAL2             ABUMU            ABAL2
#> 8      8  ABAL3             ABAL3                 
#> 9      9   ABAM              ABAM                 
#> 10    10  ABAM2             ABAM2                 
#> # ... with 130 more variables: Scientific_Name_x <chr>,
#> #   Hybrid_Genus_Indicator <chr>, Hybrid_Species_Indicator <chr>,
#> #   Species <chr>, Subspecies_Prefix <chr>,
#> #   Hybrid_Subspecies_Indicator <chr>, Subspecies <chr>,
#> #   Variety_Prefix <chr>, Hybrid_Variety_Indicator <chr>, Variety <chr>,
#> #   Subvariety_Prefix <chr>, Subvariety <chr>, Forma_Prefix <chr>,
#> #   Forma <chr>, Genera_Binomial_Author <chr>, ...
#> 
#> $error
#> NULL

You get slots:

  • count: number of results found
  • returned: numbef of results returned
  • citation: suggested citation, from USDA
  • terms: terms of use, from USDA
  • data: the results
  • error: if an error occurred, you’ll see the message here

Note that if any data.frame’s are found, we make them into tibble’s, nicely formatted data.frame’s that make it easy to deal with large data.

Pagination

limit number of results

root %>%
  api_path(search) %>%
  api_query(limit = 5)
#> $count
#> [1] 92171
#> 
#> $returned
#> [1] 5
#> 
#> $citation
#> [1] "USDA, NRCS. 2016. The PLANTS Database (http://plants.usda.gov, 12 July 2016). National Plant Data Team, Greensboro, NC 27401-4901 USA."
#> 
#> $terms
#> [1] "Our plant information, including the distribution maps, lists, and text, is not copyrighted and is free for any use."
#> 
#> $data
#> # A tibble: 5 × 134
#>      id Symbol Accepted_Symbol_x Synonym_Symbol_x
#> * <int>  <chr>             <chr>            <chr>
#> 1     1   ABAB              ABAB                 
#> 2     2  ABAB2             ABPR3            ABAB2
#> 3     3  ABAB3              ABTH            ABAB3
#> 4     4 ABAB70            ABAB70                 
#> 5     5   ABAC             ABUMB             ABAC
#> # ... with 130 more variables: Scientific_Name_x <chr>,
#> #   Hybrid_Genus_Indicator <chr>, Hybrid_Species_Indicator <chr>,
#> #   Species <chr>, Subspecies_Prefix <chr>,
#> #   Hybrid_Subspecies_Indicator <chr>, Subspecies <chr>,
#> #   Variety_Prefix <chr>, Hybrid_Variety_Indicator <chr>, Variety <chr>,
#> #   Subvariety_Prefix <chr>, Subvariety <chr>, Forma_Prefix <chr>,
#> #   Forma <chr>, Genera_Binomial_Author <chr>, ...
#> 
#> $error
#> NULL

change record to start at

root %>%
  api_path(search) %>%
  api_query(limit = 5, offset = 10)
#> $count
#> [1] 92161
#> 
#> $returned
#> [1] 5
#> 
#> $citation
#> [1] "USDA, NRCS. 2016. The PLANTS Database (http://plants.usda.gov, 12 July 2016). National Plant Data Team, Greensboro, NC 27401-4901 USA."
#> 
#> $terms
#> [1] "Our plant information, including the distribution maps, lists, and text, is not copyrighted and is free for any use."
#> 
#> $data
#> # A tibble: 5 × 134
#>      id Symbol Accepted_Symbol_x Synonym_Symbol_x
#> * <int>  <chr>             <chr>            <chr>
#> 1    11  ABAM3             ABAM3                 
#> 2    12  ABAM4              NAAM            ABAM4
#> 3    13  ABAM5              ABAB            ABAM5
#> 4    14   ABAN              ABAN                 
#> 5    15  ABANA              ABAN            ABANA
#> # ... with 130 more variables: Scientific_Name_x <chr>,
#> #   Hybrid_Genus_Indicator <chr>, Hybrid_Species_Indicator <chr>,
#> #   Species <chr>, Subspecies_Prefix <chr>,
#> #   Hybrid_Subspecies_Indicator <chr>, Subspecies <chr>,
#> #   Variety_Prefix <chr>, Hybrid_Variety_Indicator <chr>, Variety <chr>,
#> #   Subvariety_Prefix <chr>, Subvariety <chr>, Forma_Prefix <chr>,
#> #   Forma <chr>, Genera_Binomial_Author <chr>, ...
#> 
#> $error
#> NULL

Return fields

You can say what fields you want returned, useful when you just want a subset of fields

root %>%
  api_path(search) %>%
  api_query(fields = 'Genus,Species,Symbol')
#> $count
#> [1] 92171
#> 
#> $returned
#> [1] 10
#> 
#> $citation
#> [1] "USDA, NRCS. 2016. The PLANTS Database (http://plants.usda.gov, 12 July 2016). National Plant Data Team, Greensboro, NC 27401-4901 USA."
#> 
#> $terms
#> [1] "Our plant information, including the distribution maps, lists, and text, is not copyrighted and is free for any use."
#> 
#> $data
#> # A tibble: 10 × 3
#>    Symbol     Species       Genus
#> *   <chr>       <chr>       <chr>
#> 1    ABAB abutiloides    Abutilon
#> 2   ABAB2       abrus       Abrus
#> 3   ABAB3    abutilon    Abutilon
#> 4  ABAB70    abietina Abietinella
#> 5    ABAC   acutalata     Abronia
#> 6    ABAL      alpina     Abronia
#> 7   ABAL2        alba     Abronia
#> 8   ABAL3        alba       Abies
#> 9    ABAM    amabilis       Abies
#> 10  ABAM2     ameliae     Abronia
#> 
#> $error
#> NULL

Query

You can query on individual fields

root %>%
  api_path(search) %>%
  api_query(Genus = Pinus, fields = "Genus,Species")
#> $count
#> [1] 185
#> 
#> $returned
#> [1] 10
#> 
#> $citation
#> [1] "USDA, NRCS. 2016. The PLANTS Database (http://plants.usda.gov, 12 July 2016). National Plant Data Team, Greensboro, NC 27401-4901 USA."
#> 
#> $terms
#> [1] "Our plant information, including the distribution maps, lists, and text, is not copyrighted and is free for any use."
#> 
#> $data
#> # A tibble: 10 × 2
#>       Species Genus
#> *       <chr> <chr>
#> 1  albicaulis Pinus
#> 2    apacheca Pinus
#> 3    aristata Pinus
#> 4   arizonica Pinus
#> 5    armandii Pinus
#> 6   arizonica Pinus
#> 7    aristata Pinus
#> 8   arizonica Pinus
#> 9   arizonica Pinus
#> 10  attenuata Pinus
#> 
#> $error
#> NULL

Another query example

root %>%
  api_path(search) %>%
  api_query(Species = annua, fields = "Genus,Species")
#> $count
#> [1] 30
#> 
#> $returned
#> [1] 10
#> 
#> $citation
#> [1] "USDA, NRCS. 2016. The PLANTS Database (http://plants.usda.gov, 12 July 2016). National Plant Data Team, Greensboro, NC 27401-4901 USA."
#> 
#> $terms
#> [1] "Our plant information, including the distribution maps, lists, and text, is not copyrighted and is free for any use."
#> 
#> $data
#> # A tibble: 10 × 2
#>    Species         Genus
#> *    <chr>         <chr>
#> 1    annua        Adonis
#> 2    annua     Artemisia
#> 3    annua   Bulbostylis
#> 4    annua    Castilleja
#> 5    annua   Craniolaria
#> 6    annua Dimorphotheca
#> 7    annua       Drosera
#> 8    annua    Eleocharis
#> 9    annua  Fimbristylis
#> 10   annua    Heliomeris
#> 
#> $error
#> NULL

And one more example, here we’re interested in finding taxa that are perennials

root %>%
  api_path(search) %>%
  api_query(Duration = Perennial, fields = "Genus,Species,Symbol,Duration")
#> $count
#> [1] 25296
#> 
#> $returned
#> [1] 10
#> 
#> $citation
#> [1] "USDA, NRCS. 2016. The PLANTS Database (http://plants.usda.gov, 12 July 2016). National Plant Data Team, Greensboro, NC 27401-4901 USA."
#> 
#> $terms
#> [1] "Our plant information, including the distribution maps, lists, and text, is not copyrighted and is free for any use."
#> 
#> $data
#> # A tibble: 10 × 4
#>    Symbol     Species  Duration    Genus
#> *   <chr>       <chr>     <chr>    <chr>
#> 1    ABAB abutiloides Perennial Abutilon
#> 2    ABAL      alpina Perennial  Abronia
#> 3   ABAL3        alba Perennial    Abies
#> 4    ABAM    amabilis Perennial    Abies
#> 5   ABAM2     ameliae Perennial  Abronia
#> 6   ABAM3   ammophila Perennial  Abronia
#> 7    ABAR   argillosa Perennial  Abronia
#> 8    ABAU     auritum Perennial Abutilon
#> 9    ABBA    balsamea Perennial    Abies
#> 10  ABBAB    balsamea Perennial    Abies
#> 
#> $error
#> NULL

gbids - GenBank IDs API is back up!

GBIDS API is back

Back in March this year I wrote a post about a new API for working with GenBank IDs.

I had to take the API down because it was too expensive to keep up. Expensive because the dump of data is very large (3.8 GB compressed), and I need disk space on the server to uncompress that to I think about 18 GB, then load into MySQL, which is another maybe 30 GB or so. Anyway, it’s not expensive because of high traffic - although I wish that was the case - but because of needing lots of disk space.

I was fortuntate to recently receive some Amazon Cloud Credits for Research. The credits expire in 1 year. With these credits, I’ve put the GBIDS API back up. In the next year I’m hoping to gain user traction suggesting that’s is useful to enough people to keep maintaining - in which case I’ll seek ways to fund it.

But that means I need people to use it! So please to give it a try. Let me know what could be better; what could be faster; what API routes/features/etc. you’d like to see.

Plans

Plans for the future of the GBIDS API:

  • Auto-update the Genbank data. This is quite complicated since the dump is so large. I can either keep an EC2 attached disc large enough to do the dump download/expansion/load/etc, or spin up a new instance each Sunday when they do their data release, do the SQL load, make a dump, then shuttle the SQL dump to the instance running, then load in the new data from the dump. I haven’t got this bit running yet, so data is from Aug 7. 2016.
  • Add taxonomic IDs. Genbank also dumps their taxonomic IDs. I think it should be possible to get taxonomic IDs into the API, so that users can do accession number to taxon IDs and vice versa.
  • Performance: as anyone would want, I want to continually improve performance. I’ll watch out for things I can do, but also let me know what seems too slow.

Try it

Get 5 accession numbers

curl 'https://gbids.xyz/acc?limit=5' | jq .
#> {
#>   "matched": 692006925,
#>   "returned": 5,
#>   "data": [
#>     "A00002",
#>     "A00003",
#>     "X17276",
#>     "X60065",
#>     "CAA42669"
#>   ],
#>   "error": null
#> }

Request to match accession identifiers, some exist, while some do not

curl 'https://gbids.xyz/acc/AACY024124486,AACY024124483,asdfd,asdf,AACY024124476' | jq .
#> {
#>   "matched": 3,
#>   "returned": 5,
#>   "data": {
#>     "AACY024124486": true,
#>     "AACY024124483": true,
#>     "asdfd": false,
#>     "asdf": false,
#>     "AACY024124476": true
#>   },
#>   "error": null
#> }

There’s many more examples in the API docs

nonoyes - text analysis of Reply All podcast transcripts

Reply All is a great podcast. I’ve been wanting to learn some text analysis tools, and transcripts from the podcast are on their site.

Took some approaches outlined in the tidytext package in this vignette, and used the tokenizers package, and some of the tidyverse.

Code on github at sckott/nonoyes

Also check out the html version

Setup

Load deps

library("httr")
library("xml2")
library("stringi")
library("dplyr")
library("ggplot2")
library("tokenizers")
library("tidytext")
library("tidyr")

source helper functions

source("funs.R")

set base url

ra_base <- "https://gimletmedia.com/show/reply-all/episodes"

URLs

Make all urls for each page of episodes

urls <- c(ra_base, file.path(ra_base, "page", 2:8))

Get urls for each episode

res <- lapply(urls, get_urls)

Remove those that are rebroadcasts, updates, or revisited

res <- grep("rebroadcast|update|revisited", unlist(res), value = TRUE, invert = TRUE)

Episode names

Give some episodes numbers that don’t have them

epnames <- sub("/$", "", sub("https://gimletmedia.com/episode/", "", res))
epnames <- sub("the-anxiety-box", "8-the-anxiety-box", epnames)
epnames <- sub("french-connection", "10-french-connection", epnames)
epnames <- sub("ive-killed-people-and-i-have-hostages", "15-ive-killed-people-and-i-have-hostages", epnames)
epnames <- sub("6-this-proves-everything", "75-this-proves-everything", epnames)
epnames <- sub("zardulu", "56-zardulu", epnames)

Transcripts

Get transcripts

txts <- lapply(res, transcript_fetch, sleep = 1)

Parse transcripts

txtsp <- lapply(txts, transcript_parse)

Summary word usage

Summarise data for each transcript

dat <- stats::setNames(lapply(txtsp, function(m) {
  bind_rows(lapply(m, function(v) {
    tmp <- unname(vapply(v, nchar, 1))
    data_frame(
      n = length(tmp),
      mean = mean(tmp),
      n_laugh = count_word(v, "laugh"),
      n_groan = count_word(v, "groan")
    )
  }), .id = "name")
}), epnames)

Bind data together to single dataframe, and filter, summarise

data <- bind_rows(dat, .id = "episode") %>%
  filter(!is.na(episode)) %>%
  filter(grepl("^PJ$|^ALEX GOLDMAN$", name)) %>%
  mutate(ep_no = as.numeric(strextract(episode, "^[0-9]+"))) %>%
  group_by(ep_no) %>%
  mutate(nrow = NROW(ep_no)) %>%
  ungroup() %>%
  filter(nrow == 2)
data
#> # A tibble: 114 × 8
#>                 episode         name     n      mean n_laugh n_groan ep_no
#>                   <chr>        <chr> <int>     <dbl>   <int>   <int> <dbl>
#> 1            73-sandbox           PJ    89 130.65169       9       0    73
#> 2            73-sandbox ALEX GOLDMAN    25  44.00000       1       1    73
#> 3       72-dead-is-paul           PJ   137  67.77372      17       0    72
#> 4       72-dead-is-paul ALEX GOLDMAN    90  61.82222       8       0    72
#> 5  71-the-picture-taker           PJ    74  77.70270       3       0    71
#> 6  71-the-picture-taker ALEX GOLDMAN    93 105.94624       6       0    71
#> 7        69-disappeared           PJ    72  76.50000       2       0    69
#> 8        69-disappeared ALEX GOLDMAN    50 135.90000       5       0    69
#> 9      68-vampire-rules           PJ   142  88.00704       6       0    68
#> 10     68-vampire-rules ALEX GOLDMAN   117  73.16239      13       0    68
#> # ... with 104 more rows, and 1 more variables: nrow <int>

Number of words - seems PJ talks more, but didn’t do quantiative comparison

ggplot(data, aes(ep_no, n, colour = name)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_line(aes(group = ep_no), colour = "black") +
  scale_color_discrete(labels = c('Alex', 'PJ'))

Laughs per episode - take home: PJ laughs a lot

ggplot(data, aes(ep_no, n_laugh, colour = name)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_line(aes(group = ep_no), colour = "black") +
  scale_color_discrete(labels = c('Alex', 'PJ'))

Sentiment

zero <- which(vapply(txtsp, length, 1) == 0)
txtsp_ <- Filter(function(x) length(x) != 0, txtsp)

Tokenize words, and create data_frame

wordz <- stats::setNames(
  lapply(txtsp_, function(z) {
    bind_rows(
      if (is.null(try_tokenize(z$`ALEX GOLDMAN`))) {
        data_frame()
      } else {
        data_frame(
          name = "Alex",
          word = try_tokenize(z$`ALEX GOLDMAN`)
        )
      },
      if (is.null(try_tokenize(z$PJ))) {
        data_frame()
      } else {
        data_frame(
          name = "PJ",
          word = try_tokenize(z$PJ)
        )
      }
    )
  }), epnames[-zero])

Combine to single data_frame

(wordz_df <- bind_rows(wordz, .id = "episode"))
#> # A tibble: 104,713 × 3
#>       episode  name      word
#>         <chr> <chr>     <chr>
#> 1  73-sandbox  Alex      alex
#> 2  73-sandbox  Alex   goldman
#> 3  73-sandbox  Alex         i
#> 4  73-sandbox  Alex generally
#> 5  73-sandbox  Alex     don’t
#> 6  73-sandbox  Alex      alex
#> 7  73-sandbox  Alex    really
#> 8  73-sandbox  Alex      alex
#> 9  73-sandbox  Alex    groans
#> 10 73-sandbox  Alex        so
#> # ... with 104,703 more rows

Calculate sentiment using tidytext

bing <- sentiments %>%
  filter(lexicon == "bing") %>%
  select(-score)
sent <- wordz_df %>%
  inner_join(bing) %>%
  count(name, episode, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  ungroup() %>%
  filter(!is.na(episode)) %>%
  complete(episode, name) %>%
  mutate(ep_no = as.numeric(strextract(episode, "^[0-9]+")))
sent
#> # A tibble: 148 × 6
#>                                        episode  name negative positive
#>                                          <chr> <chr>    <dbl>    <dbl>
#> 1  1-an-app-sends-a-stranger-to-say-i-love-you  Alex       19       30
#> 2  1-an-app-sends-a-stranger-to-say-i-love-you    PJ       14       14
#> 3                         10-french-connection  Alex       15       32
#> 4                         10-french-connection    PJ       16       36
#> 5     11-did-errol-morris-brother-invent-email  Alex       NA       NA
#> 6     11-did-errol-morris-brother-invent-email    PJ       25       30
#> 7                           12-backend-trouble  Alex       20       15
#> 8                           12-backend-trouble    PJ       40       59
#> 9                              13-love-is-lies  Alex       NA       NA
#> 10                             13-love-is-lies    PJ       45       64
#> # ... with 138 more rows, and 2 more variables: sentiment <dbl>,
#> #   ep_no <dbl>

Names separate

ggplot(sent, aes(ep_no, sentiment, fill = name)) +
  geom_bar(stat = "identity") +
  facet_wrap(~name, ncol = 2, scales = "free_x")

Compare for each episode

ggplot(sent, aes(ep_no, sentiment, fill = name)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.5), width = 0.6)

Most common positive and negative words

Clearly, the word like is surely rarely used as a positive word meaning e.g., that they like something, but rather as the colloquial like, totally usage. So it’s removed.

Alex

sent_cont_plot(wordz_df, "Alex")

PJ

sent_cont_plot(wordz_df, "PJ")