Recology

R/etc.

 

request - a high level HTTP client for R

request is DSL for http requests for R, and is inspired by the CLI tool httpie. It's built on httr.

The following were driving principles for this package:

  • The web is increasingly a JSON world, so we assume applications/json by default, but give back other types if not
  • The workflow follows logically, or at least should, from, hey, I got this url, to i need to add some options, to execute request - and functions support piping so that you can execute functions in this order
  • Whenever possible, we transform output to data.frame's - facilitating downstream manipulation via dplyr, etc.
  • We do GET requests by default. Specify a different type if you don't want GET. Given GET by default, this client is optimized for consumption of data, rather than creating new things on servers
  • You can use non-standard evaluation to easily pass in query parameters without worrying about &'s, URL escaping, etc. (see api_query())
  • Same for body params (see api_body())

The following is a brief demo of some of the package functionality:

Install

From CRAN

install.packages("request")

Or from GitHub

devtools::install_github("sckott/request")
library("request")

Execute on last pipe

When using pipes (%>%) in request, we autodetect last piped command, and execute http() if it's the last. If not the last, the output gets passed to the next command, and so on. This feature (and magrittr) were done by Stefan Milton Bache.

This feature is really nice because a) it's one less thing you need to do, and b) you only need to care about the request itself.

You can escape auto-execution if you use the function peep(), which prints out a summary of the request you've created, but does not execute an HTTP request.

HTTP Requests

A high level function http() wraps a lower level R6 object RequestIterator, which holds a series of variables and functions to execute GET and POST requests, and will hold other HTTP verbs as well. In addition, it can hold state, which will allow us to do paging internally for you (see below). You have direct access to the R6 object if you call http_client() instead of http().

NSE and SE

Most if not all functions in request support non-standard evaluation (NSE) as well as standard evaluation (SE). If a function supports both, there's a version without an underscore for NSE, while a version with an underscore is for SE. For example, here, we make a HTTP request by passing a base URL, then a series of paths that get combined together. First the NSE version

api('https://api.github.com/') %>%
  api_path(repos, ropensci, rgbif, issues)

Then the SE version

api('https://api.github.com/') %>%
  api_path_('repos', 'ropensci', 'rgbif', 'issues')

Building API routes

The first thing you'll want to do is lay out the base URL for your request. The function api() is your friend.

api() works with full or partial URLs:

api('https://api.github.com/')
#> URL: https://api.github.com/
api('http://api.gbif.org/v1')
#> URL: http://api.gbif.org/v1
api('api.gbif.org/v1')
#> URL: api.gbif.org/v1

And works with ports, full or partial

api('http://localhost:9200')
#> URL: http://localhost:9200
api('localhost:9200')
#> URL: http://localhost:9200
api(':9200')
#> URL: http://localhost:9200
api('9200')
#> URL: http://localhost:9200
api('9200/stuff')
#> URL: http://localhost:9200/stuff

Make HTTP requests

The above examples with api() are not passed through a pipe, so only define a URL, but don't do an HTTP request. To make an HTTP request, you can either pipe a url or partial url to e.g., api(), or call http() at the end of a string of function calls:

'https://api.github.com/' %>% api()
#> $current_user_url
#> [1] "https://api.github.com/user"
#> 
#> $current_user_authorizations_html_url
#> [1] "https://github.com/settings/connections/applications{/client_id}"
#> 
#> $authorizations_url
#> [1] "https://api.github.com/authorizations"
#> 
#> $code_search_url
...

Or

api('https://api.github.com/') %>% http()
#> $current_user_url
#> [1] "https://api.github.com/user"
#> 
#> $current_user_authorizations_html_url
#> [1] "https://github.com/settings/connections/applications{/client_id}"
#> 
#> $authorizations_url
#> [1] "https://api.github.com/authorizations"
#> 
#> $code_search_url
...

http() is called at the end of a chain of piped commands, so no need to invoke it. However, you can if you like.

Templating

repo_info <- list(username = 'craigcitro', repo = 'r-travis')
api('https://api.github.com/') %>%
  api_template(template = 'repos///issues', data = repo_info)
#> [[1]]
#> [[1]]$url
#> [1] "https://api.github.com/repos/craigcitro/r-travis/issues/164"
#> 
#> [[1]]$labels_url
#> [1] "https://api.github.com/repos/craigcitro/r-travis/issues/164/labels{/name}"
#> 
#> [[1]]$comments_url
#> [1] "https://api.github.com/repos/craigcitro/r-travis/issues/164/comments"
#> ...

Set paths

api_path() adds paths to the base URL

api('https://api.github.com/') %>%
  api_path(repos, ropensci, rgbif, issues) %>%
  peep
#> <http request> 
#>   url: https://api.github.com/
#>   paths: repos/ropensci/rgbif/issues
#>   query: 
#>   body: 
#>   paging: 
#>   headers: 
#>   rate limit: 
#>   retry (n/delay (s)): /
#>   error handler: 
#>   config:

Query

api("http://api.plos.org/search") %>%
  api_query(q = ecology, wt = json, fl = journal) %>%
  peep
#> <http request> 
#>   url: http://api.plos.org/search
#>   paths: 
#>   query: q=ecology, wt=json, fl=journal
#>   body: 
#>   paging: 
#>   headers: 
#>   rate limit: 
#>   retry (n/delay (s)): /
#>   error handler: 
#>   config:

Headers

api('http://httpbin.org/headers') %>%
  api_headers(`X-FARGO-SEASON` = 3, `X-NARCOS-SEASON` = 5) %>%
  peep
#> <http request> 
#>   url: http://httpbin.org/headers
#>   paths: 
#>   query: 
#>   body: 
#>   paging: 
#>   headers: 
#>     X-FARGO-SEASON: 3
#>     X-NARCOS-SEASON: 5
#>   rate limit: 
#>   retry (n/delay (s)): /
#>   error handler: 
#>   config:

curl configuration

httr is exported in request, so you can use httr functions like verbose() to get verbose curl output

api('http://httpbin.org/headers') %>%
  api_config(verbose())
#> -> GET /headers HTTP/1.1
#> -> Host: httpbin.org
#> -> User-Agent: curl/7.43.0 curl/0.9.4 httr/1.0.0 request/0.1.0
#> -> Accept-Encoding: gzip, deflate
#> -> Accept: application/json, text/xml, application/xml, */*
#> ->
#> <- HTTP/1.1 200 OK
#> <- Server: nginx
#> <- Date: Sun, 03 Jan 2016 16:56:29 GMT
#> <- Content-Type: application/json
#> <- Content-Length: 227
#> <- Connection: keep-alive
#> <- Access-Control-Allow-Origin: *
#> <- Access-Control-Allow-Credentials: true
#> <-
#> $headers
#> $headers$Accept
#> [1] "application/json, text/xml, application/xml, */*"
#> ...

Coming soon

There's a number of interesting features that should be coming soon to request.

  • Paging - a paging helper will make it easy to do paing, and will attempt to handle any parameters used for paging. Some user input will be required, like what parameter names are, and how many records you want returned sckott/request#2
  • Retry - a retry helper will make it easy to retry http requests on any failure, and execute a user defined function on failure sckott/request#6
  • Rate limit - a rate limit helper will add info to a set of many requests - still in early design stages sckott/request#5
  • Caching - a caching helper - may use in the background the in development vcr R client when on CRAN or perhaps storr sckott/request#4

binomen - Tools for slicing and dicing taxonomic names

The first version of binomen is now up on CRAN. It provides various taxonomic classes for defining a single taxon, multiple taxa, and a taxonomic data.frame. It is designed as a companion to taxize, where you can get taxonomic data on taxonomic names from the web.

The classes (S3):

  • taxon
  • taxonref
  • taxonrefs
  • binomial
  • grouping (i.e., classification - used different term to avoid conflict with classification in taxize)

For example, the binomial class is defined by a genus, epithet, authority, and optional full species name and canonical version.

binomial("Poa", "annua", authority="L.")
<binomial>
  genus: Poa
  epithet: annua
  canonical:
  species:
  authority: L.

The package has a suite of functions to work on these taxonomic classes:

  • gethier() - get hierarchy from a taxon class
  • scatter() - make each row in taxonomic data.frame (taxondf) a separate taxon object within a single taxa object
  • assemble() - make a taxa object into a taxondf data.frame
  • pick() - pick out one or more taxonomic groups
  • pop() - pop out (drop) one or more taxonomic groups
  • span() - pick a range between two taxonomic groups (inclusive)
  • strain() - filter by taxonomic groups, like dplyr's filter
  • name() - get the taxon name for each taxonref object
  • uri() - get the reference uri for each taxonref object
  • rank() - get the taxonomic rank for each taxonref object
  • id() - get the reference uri for each taxonref object

The approach in this package I suppose is sort of like split-apply-combine from plyr/dplyr, whereas this is aims to make it easy to do with taxonomic names.

Install

For examples below, you'll need the development version:

install.packages("binomen")
library("binomen")

Make a taxon

Make a taxon object

(obj <- make_taxon(genus="Poa", epithet="annua", authority="L.",
  family='Poaceae', clazz='Poales', kingdom='Plantae', variety='annua'))
#> <taxon>
#>   binomial: Poa annua
#>   grouping: 
#>     kingdom: Plantae
#>     clazz: Poales
#>     family: Poaceae
#>     genus: Poa
#>     species: Poa annua
#>     variety: annua

Index to various parts of the object

The binomial

obj$binomial
#> <binomial>
#>   genus: Poa
#>   epithet: annua
#>   canonical: Poa annua
#>   species: Poa annua L.
#>   authority: L.

The authority

obj$binomial$authority
#> [1] "L."

The classification

obj$grouping
#> <grouping>
#>   kingdom: Plantae
#>   clazz: Poales
#>   family: Poaceae
#>   genus: Poa
#>   species: Poa annua
#>   variety: annua

The family

obj$grouping$family
#> <taxonref>
#>   rank: family
#>   name: Poaceae
#>   id: none
#>   uri: none

Subset taxon objects

Get one or more ranks via pick()

obj %>% pick(family)
#> <taxon>
#>   binomial: Poa annua
#>   grouping: 
#>     family: Poaceae
obj %>% pick(family, genus)
#> <taxon>
#>   binomial: Poa annua
#>   grouping: 
#>     family: Poaceae
#>     genus: Poa

Drop one or more ranks via pop()

obj %>% pop(family)
#> <taxon>
#>   binomial: Poa annua
#>   grouping: 
#>     kingdom: Plantae
#>     clazz: Poales
#>     genus: Poa
#>     species: Poa annua
#>     variety: annua
obj %>% pop(family, genus)
#> <taxon>
#>   binomial: Poa annua
#>   grouping: 
#>     kingdom: Plantae
#>     clazz: Poales
#>     species: Poa annua
#>     variety: annua

Get a range of ranks via span()

obj %>% span(kingdom, family)
#> <taxon>
#>   binomial: Poa annua
#>   grouping: 
#>     kingdom: Plantae
#>     clazz: Poales
#>     family: Poaceae

Extract classification as a data.frame

gethier(obj)
#>      rank      name
#> 1 kingdom   Plantae
#> 2   clazz    Poales
#> 3  family   Poaceae
#> 4   genus       Poa
#> 5 species Poa annua
#> 6 variety     annua

Taxonomic data.frame's

Make one

df <- data.frame(order = c('Asterales','Asterales','Fagales','Poales','Poales','Poales'),
  family = c('Asteraceae','Asteraceae','Fagaceae','Poaceae','Poaceae','Poaceae'),
  genus = c('Helianthus','Helianthus','Quercus','Poa','Festuca','Holodiscus'),
  stringsAsFactors = FALSE)
(df2 <- taxon_df(df))
#>       order     family      genus
#> 1 Asterales Asteraceae Helianthus
#> 2 Asterales Asteraceae Helianthus
#> 3   Fagales   Fagaceae    Quercus
#> 4    Poales    Poaceae        Poa
#> 5    Poales    Poaceae    Festuca
#> 6    Poales    Poaceae Holodiscus

Parse - get rank order via pick()

df2 %>% pick(order)
#>       order
#> 1 Asterales
#> 2 Asterales
#> 3   Fagales
#> 4    Poales
#> 5    Poales
#> 6    Poales

get ranks order, family, and genus via pick()

df2 %>% pick(order, family, genus)
#>       order     family      genus
#> 1 Asterales Asteraceae Helianthus
#> 2 Asterales Asteraceae Helianthus
#> 3   Fagales   Fagaceae    Quercus
#> 4    Poales    Poaceae        Poa
#> 5    Poales    Poaceae    Festuca
#> 6    Poales    Poaceae Holodiscus

get range of names via span(), from rank X to rank Y

df2 %>% span(family, genus)
#>       family      genus
#> 1 Asteraceae Helianthus
#> 2 Asteraceae Helianthus
#> 3   Fagaceae    Quercus
#> 4    Poaceae        Poa
#> 5    Poaceae    Festuca
#> 6    Poaceae Holodiscus

Separate each row into a taxon class (many taxon objects are a taxa class)

scatter(df2)
#> [[1]]
#> <taxon>
#>   binomial: Helianthus none
#>   grouping: 
#>     order: Asterales
#>     family: Asteraceae
#>     genus: Helianthus
#>     species: Helianthus none
#> 
#> [[2]]
#> <taxon>
#>   binomial: Helianthus none
#>   grouping: 
#>     order: Asterales
#>     family: Asteraceae
#>     genus: Helianthus
#>     species: Helianthus none
#> 
#> [[3]]
#> <taxon>
#>   binomial: Quercus none
#>   grouping: 
#>     order: Fagales
#>     family: Fagaceae
#>     genus: Quercus
#>     species: Quercus none
#> 
#> [[4]]
#> <taxon>
#>   binomial: Poa none
#>   grouping: 
#>     order: Poales
#>     family: Poaceae
#>     genus: Poa
#>     species: Poa none
#> 
#> [[5]]
#> <taxon>
#>   binomial: Festuca none
#>   grouping: 
#>     order: Poales
#>     family: Poaceae
#>     genus: Festuca
#>     species: Festuca none
#> 
#> [[6]]
#> <taxon>
#>   binomial: Holodiscus none
#>   grouping: 
#>     order: Poales
#>     family: Poaceae
#>     genus: Holodiscus
#>     species: Holodiscus none
#> 
#> attr(,"class")
#> [1] "taxa"

And you can re-assemble a data.frame from the output of scatter() with assemble()

out <- scatter(df2)
assemble(out)
#>       order     family      genus         species
#> 1 Asterales Asteraceae Helianthus Helianthus none
#> 2 Asterales Asteraceae Helianthus Helianthus none
#> 3   Fagales   Fagaceae    Quercus    Quercus none
#> 4    Poales    Poaceae        Poa        Poa none
#> 5    Poales    Poaceae    Festuca    Festuca none
#> 6    Poales    Poaceae Holodiscus Holodiscus none

Thoughts?

I'm really curious what people think of binomen. I'm not sure how useful this will be in the wild. Try it. Let me know. Thanks much :)

Crossref programmatic clients

I gave two talks recently at the annual Crossref meeting, one of which was a somewhat technical overview of programmatic clients for Crossref APIs. Check out the talk here. I talked about the motivation for working with Crossref data by writing code/etc. rather than going the GUI route, then went over the various clients, with brief examples.

We (rOpenSci) have been working on the R client rcrossref for a while now, but I'm also working on the Python and Ruby clients for Crossref. In addition, the Ruby client has a CLI client inside. The Javascript client is worked on independently by ScienceAI.

The R, Ruby, and Python clients are useable but not feature complete yet, and would benefit from lots of users surfacing bugs and highlighting nice to have features.

The main Crossref API used in all the clients is documented at api.crossref.org.

I've tried to make the APIs similar-ish across clients. Functions in each client match the main Crossref search API (api.crossref.org) routes:

  • /works
  • /members
  • /funders
  • /journals
  • /types
  • /licenses

Other methods in all three clients:

  • Get DOI minting agency
    • Uses api.crossref.org API
  • Get random DOIs
    • Uses api.crossref.org API
  • Content negotiation
  • Get full text
    • other clients in each language will focus on this use case
  • Get citation count

The following shows how to install, and then examples from each client for a few use cases.

Installation

Python

pip install habanero

Ruby

gem install serrano

R

Inside R:

install.packages("rcrossref")

Javascript

npm install crossref

I won't do any examples with the js library, as I don't maintain it.

Use case: get ORCID IDs for authors

Python

from habanero import Crossref
cr = Crossref()
res = cr.works(filter = {'has_orcid': True}, limit = 10)
res2 = [ [ z.get('ORCID') for z in x['author'] ] for x in res.result['message']['items'] ]
filter(None, reduce(lambda x, y: x+y, res2))
[u'http://orcid.org/0000-0003-4087-8021',
 u'http://orcid.org/0000-0002-2076-5452',
 u'http://orcid.org/0000-0003-4087-8021',
 u'http://orcid.org/0000-0002-2076-5452',
 u'http://orcid.org/0000-0003-1710-1580',
 u'http://orcid.org/0000-0003-1710-1580',
 u'http://orcid.org/0000-0003-4637-238X',
 u'http://orcid.org/0000-0003-4637-238X',
 u'http://orcid.org/0000-0003-4637-238X',
 u'http://orcid.org/0000-0003-4637-238X',
 u'http://orcid.org/0000-0003-4637-238X',
 u'http://orcid.org/0000-0003-2510-4271']

Ruby

require 'serrano'
res = Serrano.works(filter: {'has_orcid': true}, limit: 10)
res2 = res['message']['items'].collect { |x| x['author'].collect { |z| z['ORCID'] } }
res2.flatten.compact
=> ["http://orcid.org/0000-0003-4087-8021",
 "http://orcid.org/0000-0002-2076-5452",
 "http://orcid.org/0000-0003-4087-8021",
 "http://orcid.org/0000-0002-2076-5452",
 "http://orcid.org/0000-0003-1710-1580",
 "http://orcid.org/0000-0003-1710-1580",
 "http://orcid.org/0000-0003-4637-238X",
 "http://orcid.org/0000-0003-4637-238X",
 "http://orcid.org/0000-0003-4637-238X",
 "http://orcid.org/0000-0003-4637-238X",
 "http://orcid.org/0000-0003-4637-238X",
 "http://orcid.org/0000-0003-2510-4271"]

R

library("rcrossref")
res <- cr_works(filter=c(has_orcid=TRUE), limit = 10)
orcids <- unlist(lapply(res$data$author, function(z) z$ORCID))
Filter(function(x) !is.na(x), orcids)
 [1] "http://orcid.org/0000-0003-4087-8021"
 [2] "http://orcid.org/0000-0002-2076-5452"
 [3] "http://orcid.org/0000-0003-4087-8021"
 [4] "http://orcid.org/0000-0002-2076-5452"
 [5] "http://orcid.org/0000-0003-1710-1580"
 [6] "http://orcid.org/0000-0003-1710-1580"
 [7] "http://orcid.org/0000-0003-4637-238X"
 [8] "http://orcid.org/0000-0003-4637-238X"
 [9] "http://orcid.org/0000-0003-4637-238X"
[10] "http://orcid.org/0000-0003-4637-238X"
[11] "http://orcid.org/0000-0003-4637-238X"
[12] "http://orcid.org/0000-0003-2510-4271"

CLI

serrano works --filter=has_orcid:true --json --limit=12 | jq '.message.items[].author[].ORCID | select(. != null)'
"http://orcid.org/0000-0003-4087-8021"
"http://orcid.org/0000-0002-2076-5452"
"http://orcid.org/0000-0003-4087-8021"
"http://orcid.org/0000-0002-2076-5452"
"http://orcid.org/0000-0003-1710-1580"
"http://orcid.org/0000-0003-1710-1580"
"http://orcid.org/0000-0003-4637-238X"
"http://orcid.org/0000-0003-4637-238X"
"http://orcid.org/0000-0003-4637-238X"
"http://orcid.org/0000-0003-4637-238X"
"http://orcid.org/0000-0003-4637-238X"
"http://orcid.org/0000-0003-2510-4271"
"http://orcid.org/0000-0001-9408-8207"
"http://orcid.org/0000-0002-2076-5452"

Use case: content negotation

Python

from habanero import cn
cn.content_negotiation(ids = '10.1126/science.169.3946.635', format = "text")
u'Frank, H. S. (1970). The Structure of Ordinary Water: New data and interpretations are yielding new insights into this fascinating substance. Science, 169(3946), 635\xe2\x80\x93641. doi:10.1126/science.169.3946.635\n'

Ruby

require 'serrano'
Serrano.content_negotiation(ids: '10.1126/science.169.3946.635', format: "text")
=> ["Frank, H. S. (1970). The Structure of Ordinary Water: New data and interpretations are yielding new insights into this fascinating substance. Science, 169(3946), 635\xE2\x80\x93641. doi:10.1126/science.169.3946.635\n"]

R

library("rcrossref")
cr_cn(dois="10.1126/science.169.3946.635", "text")
[1] "Frank, H. S. (1970). The Structure of Ordinary Water: New data and interpretations are yielding new insights into this fascinating substance. Science, 169(3946), 635–641. doi:10.1126/science.169.3946.635"

CLI

serrano contneg 10.1890/13-0590.1 --format=text
Murtaugh, P. A. (2014).  In defense of P values . Ecology, 95(3), 611–617. doi:10.1890/13-0590.1

More

There are definitely issues with data in the Crossref search API, some of which I cover in my talks. However, it is still the best place to go for scholarly metadata.

Let us know of other use cases - there are others not covered here for brevity sake.

There are lots of examples in the docs for each client. If you can think of any doc improvements file an issue.

If you find any bugs, please do file an issue.