Recology

R/etc.

 

httping - ping and time http requests

I've been working on a little thing called httping - a small R package that started as a pkg to Ping urls and time requests. It's a port of the Ruby gem httping. The httr package is in Depends in this package, so its functions can be called directly, without having to load httr explicitly yourself.

In addition to timing requests, I've been tinkering with how to make http requests, with curl options accepting and returning the same object so they can be chained together, and then that object passed to a http verb like GET. Maybe this is a bad idea, but maybe not.

Installation

Install:

One non-CRAN dep (httpcode) needs to be installed first.

install.packages("devtools")
devtools::install_github("sckott/httpcode")
devtools::install_github("sckott/httping")

Then load package

library("httping")

Time requests

The idea with time() is to provide easy to use and understand information on how long http requests take to run. You should be able to pass in any httr verbs (GET(), POST(), etc.) to time(). time() repeats whatever http request you pass to it by default 10 times, but you can set the number of times to repeat in the count parameter. In addition, the flood parameter controls whether there is a delay between requests or not, and delay controls length of the delay.

A GET request

GET("http://google.com") %>% time(count=3)
#> 29.392 kb - http://www.google.com/ code:200 time(ms):92.444
#> 29.392 kb - http://www.google.com/ code:200 time(ms):82.127
#> 29.392 kb - http://www.google.com/ code:200 time(ms):85.587
#> <http time>
#>   Avg. min (ms):  82.127
#>   Avg. max (ms):  92.444
#>   Avg. mean (ms): 86.71933

A POST request

POST("http://httpbin.org/post", body = "A simple text string") %>% time(count=3)
#> 10.144 kb - http://httpbin.org/post code:200 time(ms):267.574
#> 10.144 kb - http://httpbin.org/post code:200 time(ms):113.309
#> 10.144 kb - http://httpbin.org/post code:200 time(ms):99.938
#> <http time>
#>   Avg. min (ms):  99.938
#>   Avg. max (ms):  267.574
#>   Avg. mean (ms): 160.2737

The return object is a list with slots for all the httr response objects, the times for each request, and the average times. The number of requests, and the delay between requests are included as attributes.

res <- GET("http://google.com") %>% time(count=3)
#> 29.392 kb - http://www.google.com/ code:200 time(ms):82.086
#> 29.392 kb - http://www.google.com/ code:200 time(ms):78.15
#> 29.392 kb - http://www.google.com/ code:200 time(ms):79.763
attributes(res)
#> $names
#> [1] "times"    "averages" "request" 
#> 
#> $count
#> [1] 3
#> 
#> $delay
#> [1] 0.5
#> 
#> $class
#> [1] "http_time"

Or print a summary of a response, gives more detail

res %>% summary()
#> <http time, averages (min max mean)>
#>   Total (s):           78.15 82.086 79.99967
#>   Tedirect (s):        26.695 34.319 29.80633
#>   Namelookup time (s): 0.025 0.03 0.028
#>   Connect (s):         0.028 0.034 0.032
#>   Pretransfer (s):     0.069 0.081 0.07633333
#>   Starttransfer (s):   45.44 49.326 47.95867

Messages are printed using cat, so you can suppress those using verbose=FALSE, like

GET("http://google.com") %>% time(count=3, verbose = FALSE)
#> <http time>
#>   Avg. min (ms):  86.12
#>   Avg. max (ms):  94.035
#>   Avg. mean (ms): 89.12467

Ping an endpoint

The idea with ping() is to simply return the http status code along with a message describing what that code means. That's it.

This function is a bit different, accepts a url as first parameter, but can accept any args passed on to httr verb functions, like GET, POST, etc.

"http://google.com" %>% ping()
#> <http ping> 200
#>   Message: OK
#>   Description: Request fulfilled, document follows

Or pass in additional arguments to modify request

"http://google.com" %>% ping(config=verbose())
#> -> GET / HTTP/1.1
#> -> User-Agent: curl/7.37.1 Rcurl/1.95.4.5 httr/0.6.1
#> -> Host: google.com
#> -> Accept-Encoding: gzip
...cutoff

Even simpler verbs

httr is already easy, but Get():

  • Allows use of an intuitive chaining workflow
  • Parses data for you using httr built in format guesser, which should work in most cases

A simple GET request

"http://httpbin.org/get" %>%
  Get()
#> $args
#> named list()
#> 
#> $headers
#> $headers$Accept
#> [1] "application/json, text/xml, application/xml, */*"
#> 
#> $headers$`Accept-Encoding`
#> [1] "gzip"
#> 
#> $headers$Host
#> [1] "httpbin.org"
#> 
#> $headers$`User-Agent`
#> [1] "curl/7.37.1 Rcurl/1.95.4.5 httr/0.6.1"
#> 
#> 
#> $origin
#> [1] "24.21.209.71"
#> 
#> $url
#> [1] "http://httpbin.org/get"

You can buid up options by calling functions

"http://httpbin.org/get" %>%
  Progress() %>%
  Verbose()
#> <http request> 
#>   url: http://httpbin.org/get
#>   config: 
#> Config: 
#> List of 4
#>  $ noprogress      :FALSE
#>  $ progressfunction:function (...)  
#>  $ debugfunction   :function (...)  
#>  $ verbose         :TRUE

Then eventually execute the GET request

"http://httpbin.org/get" %>%
  Verbose() %>%
  Progress() %>%
  Get()
#> -> GET /get HTTP/1.1
#> -> User-Agent: curl/7.37.1 Rcurl/1.95.4.5 httr/0.6.1
#> -> Host: httpbin.org
#> -> Accept-Encoding: gzip
#> -> Accept: application/json, text/xml, application/xml, */*
#> -> 
#> <- HTTP/1.1 200 OK
#> <- Server: nginx
#> <- Date: Fri, 30 Jan 2015 17:38:58 GMT
#> <- Content-Type: application/json
#> <- Content-Length: 288
#> <- Connection: keep-alive
#> <- Access-Control-Allow-Origin: *
#> <- Access-Control-Allow-Credentials: true
#> <- 
#>   |=======================================| 100%
#> 
#> $args
#> named list()
#> 
#> $headers
#> $headers$Accept
#> [1] "application/json, text/xml, application/xml, */*"
#> 
#> $headers$`Accept-Encoding`
#> [1] "gzip"
#> 
#> $headers$Host
#> [1] "httpbin.org"
#> 
#> $headers$`User-Agent`
#> [1] "curl/7.37.1 Rcurl/1.95.4.5 httr/0.6.1"
#> 
#> 
#> $origin
#> [1] "24.21.209.71"
#> 
#> $url
#> [1] "http://httpbin.org/get"
#> 

elastic - Elasticsearch from R

We've (ropensci) been working on an R client for interacting with Elasticsearch for a while now, first commit was November 2013.

Elasticsearch is a document database built on the JVM. elastic interacts with the Elasticsearch HTTP API, and includes functions for setting connection details to Elasticsearch instances, loading bulk data, searching for documents with both HTTP query variables and JSON based body requests. In addition, elastic provides functions for interacting with APIs for indices, documents, nodes, clusters, an interface to the cat API, and more.

Here's a few examples of what you can do.

Note: elastic was just pushed to CRAN. It just got accepted, so binaries may not be available, try again soon, or install from Github, or install from source from CRAN like install.packages("http://cran.r-project.org/src/contrib/elastic_0.3.0.tar.gz", repos=NULL, type="source").

Installation

install.packages("elastic")

Or install development version:

install.packages("devtools")
devtools::install_github("ropensci/elastic")

Then load package

library("elastic")

Install Elasticsearch

Unix (linux/osx)

Replace 1.4.1 with the version you are working with.

  • Download zip or tar file from Elasticsearch see here for download, e.g., curl -L -O https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.1.tar.gz
  • Uncompress it: tar -xvf elasticsearch-1.4.1.tar.gz
  • Move it: sudo mv /path/to/elasticsearch-1.4.1 /usr/local
  • Navigate to /usr/local: cd /usr/local
  • Add shortcut: sudo ln -s elasticsearch-1.4.1 elasticsearch

On OSX, you can install via Homebrew: brew install elasticsearch

Windows

Windows users can follow the above, but unzip the zip file instead of uncompressing the tar file.

Start Elasticsearch

  • Navigate to elasticsearch: cd /usr/local/elasticsearch
  • Start elasticsearch: bin/elasticsearch

I create a little bash shortcut called es that does both of the above commands in one step (cd /usr/local/elasticsearch && bin/elasticsearch).

Note: Windows users should run the elasticsearch.bat file

Initialize connection

The function connect() is used before doing anything else to set the connection details to your remote or local elasticsearch store. The details created by connect() are written to your options for the current session, and are used by elastic functions.

connect()

On package load, your base url and port are set to http://127.0.0.1 and 9200, respectively. You can of course override these settings per session or for all sessions.

Get data

Elasticsearch has a bulk load API to load data in fast. The format is pretty weird though. It's sort of JSON, but would pass no JSON linter. I include a few data sets in elastic so it's easy to get up and running, and so when you run examples in this package they'll actually run the same way (hopefully).

Shakespeare data

Elasticsearch provides some data on Shakespeare plays. I've provided a subset of this data in this package. Get the path for the file specific to your machine:

shakespeare <- system.file("examples", "shakespeare_data.json", package = "elastic")

Then load the data into Elasticsearch:

docs_bulk(shakespeare)

Public Library of Science (PLOS) data

A dataset inluded in the elastic package is metadata for PLOS scholarly articles.

plosdat <- system.file("examples", "plos_data.json", package = "elastic")
docs_bulk(plosdat)

Global Biodiversity Information Facility (GBIF) data

A dataset inluded in the elastic package is data for GBIF species occurrence records. Get the file path, then load:

gbifdat <- system.file("examples", "gbif_data.json", package = "elastic")
docs_bulk(gbifdat)

GBIF geo data with a coordinates element to allow geo_shape queries

gbifgeo <- system.file("examples", "gbif_geo.json", package = "elastic")
docs_bulk(gbifgeo)

The Search function

The main interface to searching documents in your Elasticsearch store is the function Search(). I nearly always develop R software using all lowercase, but R has a function called search(), and I wanted to avoid collision with that function.

Search() is an interface to both the HTTP search API (in which queries are passed in the URI of the request, meaning queries have to be relatively simple), as well as the POST API, or the Query DSL, in which queries are passed in the body of the request (so can be much more complex).

There are a huge amount of ways you can search Elasticsearch documents - this tutorial covers some of them, and highlights the ways in which you interact with the R outputs.

Search an index

out <- Search(index="shakespeare")
out$hits$total
#> [1] 5000
out$hits$hits[[1]]
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "4"
#> 
#> $`_version`
#> [1] 1
#> 
#> $`_score`
#> [1] 1
#> 
#> $`_source`
#> $`_source`$line_id
#> [1] 5
#> 
#> $`_source`$play_name
#> [1] "Henry IV"
#> 
#> $`_source`$speech_number
#> [1] 1
#> 
#> $`_source`$line_number
#> [1] "1.1.2"
#> 
#> $`_source`$speaker
#> [1] "KING HENRY IV"
#> 
#> $`_source`$text_entry
#> [1] "Find we a time for frighted peace to pant,"

Search an index by type

Search(index="shakespeare", type="act")$hits$hits[[1]]
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "act"
#> 
#> $`_id`
#> [1] "2227"
#> 
#> $`_version`
#> [1] 1
#> 
#> $`_score`
#> [1] 1
#> 
#> $`_source`
#> $`_source`$line_id
#> [1] 2228
#> 
#> $`_source`$play_name
#> [1] "Henry IV"
#> 
#> $`_source`$speech_number
#> [1] 81
#> 
#> $`_source`$line_number
#> [1] ""
#> 
#> $`_source`$speaker
#> [1] "FALSTAFF"
#> 
#> $`_source`$text_entry
#> [1] "ACT IV"

Return certain fields

Search(index="shakespeare", fields=c('play_name','speaker'))$hits$hits[[1]]
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "4"
#> 
#> $`_version`
#> [1] 1
#> 
#> $`_score`
#> [1] 1
#> 
#> $fields
#> $fields$speaker
#> $fields$speaker[[1]]
#> [1] "KING HENRY IV"
#> 
#> 
#> $fields$play_name
#> $fields$play_name[[1]]
#> [1] "Henry IV"

Sorting

Search(index="shakespeare", type="act", sort="text_entry")$hits$hits[1:2]
#> [[1]]
#> [[1]]$`_index`
#> [1] "shakespeare"
#> 
#> [[1]]$`_type`
#> [1] "act"
#> 
#> [[1]]$`_id`
#> [1] "2227"
#> 
#> [[1]]$`_version`
#> [1] 1
#> 
#> [[1]]$`_score`
#> NULL
#> 
#> [[1]]$`_source`
#> [[1]]$`_source`$line_id
#> [1] 2228
#> 
#> [[1]]$`_source`$play_name
#> [1] "Henry IV"
#> 
#> [[1]]$`_source`$speech_number
#> [1] 81
#> 
#> [[1]]$`_source`$line_number
#> [1] ""
#> 
#> [[1]]$`_source`$speaker
#> [1] "FALSTAFF"
#> 
#> [[1]]$`_source`$text_entry
#> [1] "ACT IV"
#> 
#> 
#> [[1]]$sort
#> [[1]]$sort[[1]]
#> [1] "act"
#> 
#> 
#> 
#> [[2]]
#> [[2]]$`_index`
#> [1] "shakespeare"
#> 
#> [[2]]$`_type`
#> [1] "act"
#> 
#> [[2]]$`_id`
#> [1] "2633"
#> 
#> [[2]]$`_version`
#> [1] 1
#> 
#> [[2]]$`_score`
#> NULL
#> 
#> [[2]]$`_source`
#> [[2]]$`_source`$line_id
#> [1] 2634
#> 
#> [[2]]$`_source`$play_name
#> [1] "Henry IV"
#> 
#> [[2]]$`_source`$speech_number
#> [1] 9
#> 
#> [[2]]$`_source`$line_number
#> [1] ""
#> 
#> [[2]]$`_source`$speaker
#> [1] "ARCHBISHOP OF YORK"
#> 
#> [[2]]$`_source`$text_entry
#> [1] "ACT V"
#> 
#> 
#> [[2]]$sort
#> [[2]]$sort[[1]]
#> [1] "act"

Paging

Search(index="shakespeare", size=1, from=1, fields='text_entry')$hits
#> $total
#> [1] 5000
#> 
#> $max_score
#> [1] 1
#> 
#> $hits
#> $hits[[1]]
#> $hits[[1]]$`_index`
#> [1] "shakespeare"
#> 
#> $hits[[1]]$`_type`
#> [1] "line"
#> 
#> $hits[[1]]$`_id`
#> [1] "9"
#> 
#> $hits[[1]]$`_version`
#> [1] 1
#> 
#> $hits[[1]]$`_score`
#> [1] 1
#> 
#> $hits[[1]]$fields
#> $hits[[1]]$fields$text_entry
#> $hits[[1]]$fields$text_entry[[1]]
#> [1] "Nor more shall trenching war channel her fields,"

Queries

Using the q parameter you can pass in a query, which gets passed in the URI of the query. This type of query is less powerful than the below query passed in the body of the request, using the body parameter.

Search(index="shakespeare", type="act", q="speaker:KING HENRY IV")$hits$total
#> [1] 9

Query DSL searches - queries sent in the body of the request

Pass in as an R list

aggs <- list(aggs = list(stats = list(terms = list(field = "text_entry"))))
Search(index="shakespeare", body=aggs)$hits$hits[[1]]
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "4"
#> 
#> $`_version`
#> [1] 1
#> 
#> $`_score`
#> [1] 1
#> 
#> $`_source`
#> $`_source`$line_id
#> [1] 5
#> 
#> $`_source`$play_name
#> [1] "Henry IV"
#> 
#> $`_source`$speech_number
#> [1] 1
#> 
#> $`_source`$line_number
#> [1] "1.1.2"
#> 
#> $`_source`$speaker
#> [1] "KING HENRY IV"
#> 
#> $`_source`$text_entry
#> [1] "Find we a time for frighted peace to pant,"

Or pass in as json query with newlines, easy to read

aggs <- '{
    "aggs": {
        "stats" : {
            "terms" : {
                "field" : "text_entry"
            }
        }
    }
}'
Search(index="shakespeare", body=aggs)$hits$hits[[1]]
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "4"
#> 
#> $`_version`
#> [1] 1
#> 
#> $`_score`
#> [1] 1
#> 
#> $`_source`
#> $`_source`$line_id
#> [1] 5
#> 
#> $`_source`$play_name
#> [1] "Henry IV"
#> 
#> $`_source`$speech_number
#> [1] 1
#> 
#> $`_source`$line_number
#> [1] "1.1.2"
#> 
#> $`_source`$speaker
#> [1] "KING HENRY IV"
#> 
#> $`_source`$text_entry
#> [1] "Find we a time for frighted peace to pant,"

Or pass in collapsed json string

aggs <- '{"aggs":{"stats":{"terms":{"field":"text_entry"}}}}'
Search(index="shakespeare", body=aggs)$hits$hits[[1]]
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "4"
#> 
#> $`_version`
#> [1] 1
#> 
#> $`_score`
#> [1] 1
#> 
#> $`_source`
#> $`_source`$line_id
#> [1] 5
#> 
#> $`_source`$play_name
#> [1] "Henry IV"
#> 
#> $`_source`$speech_number
#> [1] 1
#> 
#> $`_source`$line_number
#> [1] "1.1.2"
#> 
#> $`_source`$speaker
#> [1] "KING HENRY IV"
#> 
#> $`_source`$text_entry
#> [1] "Find we a time for frighted peace to pant,"

Fuzzy query

Fuzzy query on numerics

fuzzy <- list(query = list(fuzzy = list(speech_number = 7)))
Search(index="shakespeare", body=fuzzy)$hits$total
#> [1] 523
fuzzy <- list(query = list(fuzzy = list(speech_number = list(value = 7, fuzziness = 4))))
Search(index="shakespeare", body=fuzzy)$hits$total
#> [1] 1499

Range query

With numeric

body <- list(query=list(range=list(decimalLongitude=list(gte=1, lte=3))))
Search('gbif', body=body)$hits$total
#> [1] 24
body <- list(query=list(range=list(decimalLongitude=list(gte=2.9, lte=10))))
Search('gbif', body=body)$hits$total
#> [1] 166

With dates

body <- list(query=list(range=list(eventDate=list(gte="2012-01-01", lte="now"))))
Search('gbif', body=body)$hits$total
#> [1] 899
body <- list(query=list(range=list(eventDate=list(gte="2014-01-01", lte="now"))))
Search('gbif', body=body)$hits$total
#> [1] 685

Highlighting

body <- '{
 "query": {
   "query_string": {
     "query" : "cell"
   }
 },
 "highlight": {
   "fields": {
     "title": {"number_of_fragments": 2}
   }
 }
}'
out <- Search('plos', 'article', body=body)
out$hits$total
#> [1] 57
sapply(out$hits$hits, function(x) x$highlight$title[[1]])[8:10]
#> [1] "c-FLIP Protects Eosinophils from TNF-α-Mediated <em>Cell</em> Death In Vivo"                          
#> [2] "DUSP1 Is a Novel Target for Enhancing Pancreatic Cancer <em>Cell</em> Sensitivity to Gemcitabine"     
#> [3] "Carbon Ion Radiation Inhibits Glioma and Endothelial <em>Cell</em> Migration Induced by Secreted VEGF"

Scrolling search - instead of paging

Search('shakespeare', q="a*")$hits$total
#> [1] 2747
res <- Search(index = 'shakespeare', q="a*", scroll="1m")
res <- Search(index = 'shakespeare', q="a*", scroll="1m", search_type = "scan")
length(scroll(scroll_id = res$`_scroll_id`)$hits$hits)
#> [1] 50
res <- Search(index = 'shakespeare', q="a*", scroll="5m", search_type = "scan")
out <- list()
hits <- 1
while(hits != 0){
  res <- scroll(scroll_id = res$`_scroll_id`)
  hits <- length(res$hits$hits)
  if(hits > 0)
    out <- c(out, res$hits$hits)
}
length(out)
#> [1] 2747

Woohoo! Collected all 2747 documents in very little time.

The cat API

List cat methods

cat_()
#> =^.^=
#> /_cat/allocation
#> /_cat/shards
#> /_cat/shards/{index}
#> /_cat/master
#> /_cat/nodes
#> /_cat/indices
#> /_cat/indices/{index}
#> /_cat/segments
#> /_cat/segments/{index}
#> /_cat/count
#> /_cat/count/{index}
#> /_cat/recovery
#> /_cat/recovery/{index}
#> /_cat/health
#> /_cat/pending_tasks
#> /_cat/aliases
#> /_cat/aliases/{alias}
#> /_cat/thread_pool
#> /_cat/plugins
#> /_cat/fielddata
#> /_cat/fielddata/{fields}

Get aliases

cat_aliases()
#> things plos - - - 
#> stuff  plos - - -

Get indices

cat_indices()
#> yellow open plosmore     5 1  1000  0   3.5mb   3.5mb 
#> yellow open leotheadfadf 5 1     0  0    575b    575b 
#> red    open alsothat     3 2                          
#> yellow open gbif         5 1   899  0     1mb     1mb 
#> yellow open gbifgeopoint 5 1     0  0    575b    575b 
#> yellow open gbifnewgeo   5 1     2  0   5.8kb   5.8kb 
#> yellow open plos         5 1  1202 39  14.2mb  14.2mb 
#> yellow open leothedog    5 1     0  0    575b    575b 
#> yellow open shakespeare  5 1  5000  0     1mb     1mb 
#> yellow open gbifgeo      5 1   600  0 861.9kb 861.9kb 
#> yellow open plosbigdata  5 1 20000  0  53.6mb  53.6mb 
#> yellow open mapuris      5 1    31  0  34.4kb  34.4kb 
#> yellow open leothelion   5 1     0  0    575b    575b

Get nodes

cat_nodes()
#> Scotts-MacBook-Pro.local 192.168.1.104 6 79 3.44 d * Hellfire

Work with indices

out <- index_get(index='shakespeare')
names(out$shakespeare$mappings)
#> [1] "line"  "scene" "act"

Check for index existence

index_exists(index='shakespeare')
#> [1] TRUE

Delete an index

index_delete(index='plos')
#> $acknowledged
#> [1] TRUE

Create an index

index_create(index='twitter')
#> $acknowledged
#> [1] TRUE

Work with documents

Get a document

docs_get(index='shakespeare', type='line', id=10)
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "10"
#> 
#> $`_version`
#> [1] 1
#> 
#> $found
#> [1] TRUE
#> 
#> $`_source`
#> $`_source`$line_id
#> [1] 11
#> 
#> $`_source`$play_name
#> [1] "Henry IV"
#> 
#> $`_source`$speech_number
#> [1] 1
#> 
#> $`_source`$line_number
#> [1] "1.1.8"
#> 
#> $`_source`$speaker
#> [1] "KING HENRY IV"
#> 
#> $`_source`$text_entry
#> [1] "Nor bruise her flowerets with the armed hoofs"

Get certain fields

docs_get(index='shakespeare', type='line', id=10, fields=c('play_name','speaker'))
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "10"
#> 
#> $`_version`
#> [1] 1
#> 
#> $found
#> [1] TRUE
#> 
#> $fields
#> $fields$play_name
#> $fields$play_name[[1]]
#> [1] "Henry IV"
#> 
#> 
#> $fields$speaker
#> $fields$speaker[[1]]
#> [1] "KING HENRY IV"

Test for existence of the document

docs_get(index='plos', type='article', id=1, exists=TRUE)
#> [1] FALSE
docs_get(index='plos', type='article', id=123456, exists=TRUE)
#> [1] FALSE

Thats it

Let us know if you have any feedback!

binomen - taxonomic classes and parsing

I maintain, along with other awesome people, the taxize R package - a taxonomic toolbelt for R, for interacting with taxonomic data sources on the web.

Taxonomy data is not standardized, but there are a lot of common elements, and there is a finite list of taxonomic ranks, and finite number of major taxonomic data sets. Thus, I've been interested in attempting to define a pseudo standard for expressing taxonomic data in R. The conversation started a while back in a GitHub issue, and hasn't moved very far.

I decided to start playing with this more, which is easier to do in a separate pacakge. Thus: binomen. It's an attempt to define a set of taxonomic classes/objects in R, along with a suite of functions to help construct and parse these objects.

Would love any/all feedback.

Here's some examples:

Install

Install binomen

install.packages("devtools")
devtools::install_github("ropensci/binomen")
library("binomen")

Make a taxon

Make a taxon object

(obj <- make_taxon(genus="Poa", epithet="annua", authority="L.",
                   family='Poaceae', clazz='Poales', 
                   kingdom='Plantae', variety='annua'))
#> <taxon>
#>   binomial: Poa annua
#>   classification: 
#>     kingdom: Plantae
#>     clazz: Poales
#>     family: Poaceae
#>     genus: Poa
#>     species: Poa annua
#>     variety: annua

Index to various parts of the object

The binomial

obj$binomial
#> <binomial>
#>   genus: Poa
#>   epithet: annua
#>   canonical: Poa annua
#>   species: Poa annua L.
#>   authority: L.

The authority

obj$binomial$authority
#> [1] "L."

The classification

obj$classification
#> <classification>
#>     kingdom: Plantae
#>     clazz: Poales
#>     family: Poaceae
#>     genus: Poa
#>     species: Poa annua
#>     variety: annua

The family

obj$classification$family
#> <taxonref>
#>   rank: family
#>   name: Poaceae
#>   id: none
#>   uri: none

Subset taxon objects

Get a single rank

obj %>% select(family)
#> <taxonref>
#>   rank: family
#>   name: Poaceae
#>   id: none
#>   uri: none

Get a range of ranks

obj %>% range(kingdom, family)
#> $kingdom
#> <taxonref>
#>   rank: kingdom
#>   name: Plantae
#>   id: none
#>   uri: none
#> 
#> $clazz
#> <taxonref>
#>   rank: clazz
#>   name: Poales
#>   id: none
#>   uri: none
#> 
#> $family
#> <taxonref>
#>   rank: family
#>   name: Poaceae
#>   id: none
#>   uri: none

Extract classification as a data.frame

gethier(obj)
#>      rank      name
#> 1 kingdom   Plantae
#> 2   clazz    Poales
#> 3  family   Poaceae
#> 4   genus       Poa
#> 5 species Poa annua
#> 6 variety     annua

Taxonomic data.frame's

Make one

df <- data.frame(
  order=c('Asterales','Asterales','Fagales','Poales','Poales','Poales'),
  family=c('Asteraceae','Asteraceae','Fagaceae','Poaceae','Poaceae','Poaceae'),
  genus=c('Helianthus','Helianthus','Quercus','Poa','Festuca','Holodiscus'),
  stringsAsFactors = FALSE)
(df2 <- taxon_df(df))
#>       order     family      genus
#> 1 Asterales Asteraceae Helianthus
#> 2 Asterales Asteraceae Helianthus
#> 3   Fagales   Fagaceae    Quercus
#> 4    Poales    Poaceae        Poa
#> 5    Poales    Poaceae    Festuca
#> 6    Poales    Poaceae Holodiscus

Parse - get rank order matching Fagales

df2 %>% select(order, Fagales)
#>     order   family   genus
#> 3 Fagales Fagaceae Quercus

get rank family matching Asteraceae

df2 %>% select(family, Asteraceae)
#>       order     family      genus
#> 1 Asterales Asteraceae Helianthus
#> 2 Asterales Asteraceae Helianthus

get rank genus matching Poa

df2 %>% select(genus, Poa)
#>    order  family genus
#> 4 Poales Poaceae   Poa