Recology

R/etc.

 

Faster solr with csv

With the help of user input, I've tweaked solr just a bit to make things faster using default setings. I imagine the main interface for people using the solr R client is via solr_search(), which used to have wt=json by default. Changing this to wt=csv gives better performance. And it sorta makes sense to use csv, as the point of using an R client is probably do get data eventually into a data.frame, so it makes sense to go csv format (Already in tabular format) if it's faster too.

Install

Install and load solr

devtools::install_github("ropensci/solr")
library("solr")
library("microbenchmark")

Setup

Define base url and fields to return

url <- 'http://api.plos.org/search'
fields <- c('id','cross_published_journal_name','cross_published_journal_key',
            'cross_published_journal_eissn','pmid','pmcid','publisher','journal',
            'publication_date','article_type','article_type_facet','author',
            'author_facet','volume','issue','elocation_id','author_display',
            'competing_interest','copyright')

json

The previous default for solr_search() used json

solr_search(q='*:*', rows=10, fl=fields, base=url, wt = "json")
#> Source: local data frame [10 x 19]
#> 
#>                                                                    id
#> 1             10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4
#> 2       10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/title
#> 3    10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/abstract
#> 4  10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/references
#> 5        10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/body
#> 6             10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525
#> 7       10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/title
#> 8    10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/abstract
#> 9  10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/references
#> 10       10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/body
#> Variables not shown: cross_published_journal_name (chr),
#>   cross_published_journal_key (chr), cross_published_journal_eissn (chr),
#>   pmid (chr), pmcid (chr), publisher (chr), journal (chr),
#>   publication_date (chr), article_type (chr), article_type_facet (chr),
#>   author (chr), author_facet (chr), volume (int), issue (int),
#>   elocation_id (chr), author_display (chr), competing_interest (chr),
#>   copyright (chr)

csv

The default wt setting is now csv

solr_search(q='*:*', rows=10, fl=fields, base=url, wt = "json")
#> Source: local data frame [10 x 19]
#> 
#>                                                                    id
#> 1             10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4
#> 2       10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/title
#> 3    10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/abstract
#> 4  10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/references
#> 5        10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/body
#> 6             10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525
#> 7       10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/title
#> 8    10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/abstract
#> 9  10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/references
#> 10       10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/body
#> Variables not shown: cross_published_journal_name (chr),
#>   cross_published_journal_key (chr), cross_published_journal_eissn (chr),
#>   pmid (chr), pmcid (chr), publisher (chr), journal (chr),
#>   publication_date (chr), article_type (chr), article_type_facet (chr),
#>   author (chr), author_facet (chr), volume (int), issue (int),
#>   elocation_id (chr), author_display (chr), competing_interest (chr),
#>   copyright (chr)

Compare times

When parsing to a data.frame (which solr_search() does by default), csv is quite a bit faster.

microbenchmark(
  json = solr_search(q='*:*', rows=500, fl=fields, base=url, wt = "json", verbose = FALSE),
  csv = solr_search(q='*:*', rows=500, fl=fields, base=url, wt = "csv", verbose = FALSE), 
  times = 20
)
#> Unit: milliseconds
#>  expr      min       lq      mean    median        uq       max neval cld
#>  json 965.7043 1013.014 1124.1229 1086.3225 1227.9054 1441.8332    20   b
#>   csv 509.6573  520.089  541.5784  532.4546  548.0303  723.7575    20  a

json vs xml vs csv

When getting raw data, csv is best, json next, then xml pulling up the rear.

microbenchmark(
  json = solr_search(q='*:*', rows=1000, fl=fields, base=url, wt = "json", verbose = FALSE, raw = TRUE),
  csv = solr_search(q='*:*', rows=1000, fl=fields, base=url, wt = "csv", verbose = FALSE, raw = TRUE),
  xml = solr_search(q='*:*', rows=1000, fl=fields, base=url, wt = "xml", verbose = FALSE, raw = TRUE),
  times = 10
)
#> Unit: milliseconds
#>  expr       min       lq      mean    median        uq       max neval cld
#>  json 1110.9515 1142.478 1198.9981 1169.0808 1195.5709 1518.7412    10  b 
#>   csv  801.6871  802.516  826.0655  819.1532  835.0512  873.4266    10 a  
#>   xml 1507.1111 1554.002 1618.5963 1617.5208 1671.0026 1740.4448    10   c

Notes

Note that wt=csv is only available in solr_search() and solr_all() because csv writer only returns the docs element in csv, dropping other elements, including facets, mlt, groups, stats, etc.

Also, note the http client used in solr is httr, which passes in a gzip compression header by default, so as long as the server serving up the Solr data has compression turned on, that's all set.

Another way I've sped things up is if you use wt=json then parse to a data.frame, it uses dplyr which sped things up considerably.

PUT dataframes on your couch

It would be nice to easily push each row or column of a data.frame into CouchDB instead of having to prepare them yourself into JSON, then push in to couch. I recently added ability to push data.frame's into couch using the normal PUT /{db} method, and added support for the couch bulk API.

Install

install.packages("devtools")
devtools::install_github("sckott/sofa")
library("sofa")

PUT /db

You can write directly from a data.frame, either by rows or columns. First, rows:

#> $ok
#> [1] TRUE

Create a database

db_create(dbname="mtcarsdb")
#> $ok
#> [1] TRUE
out <- doc_create(mtcars, dbname="mtcarsdb", how="rows")
out[1:2]
#> $`Mazda RX4`
#> $`Mazda RX4`$ok
#> [1] TRUE
#> 
#> $`Mazda RX4`$id
#> [1] "0063109bfb1c15765854cbc9525c3a7a"
#> 
#> $`Mazda RX4`$rev
#> [1] "1-3946941c894a874697554e3e6d9bc176"
#> 
#> 
#> $`Mazda RX4 Wag`
#> $`Mazda RX4 Wag`$ok
#> [1] TRUE
#> 
#> $`Mazda RX4 Wag`$id
#> [1] "0063109bfb1c15765854cbc9525c461d"
#> 
#> $`Mazda RX4 Wag`$rev
#> [1] "1-273ff17a938cb956cba21051ab428b95"

Then by columns

out <- doc_create(mtcars, dbname="mtcarsdb", how="columns")
out[1:2]
#> $mpg
#> $mpg$ok
#> [1] TRUE
#> 
#> $mpg$id
#> [1] "0063109bfb1c15765854cbc9525d4f1f"
#> 
#> $mpg$rev
#> [1] "1-4b83d0ef53a28849a872d47ad03fef9a"
#> 
#> 
#> $cyl
#> $cyl$ok
#> [1] TRUE
#> 
#> $cyl$id
#> [1] "0063109bfb1c15765854cbc9525d57d3"
#> 
#> $cyl$rev
#> [1] "1-c21bfa5425c67743f0826fd4b44b0dbf"

Bulk API

The bulk API will/should be faster for larger data.frames

#> $ok
#> [1] TRUE

We'll use part of the diamonds dataset

library("ggplot2")
dat <- diamonds[1:20000,]

Create a database

db_create(dbname="bulktest")
#> $ok
#> [1] TRUE

Load by row (could instead do each column, see how parameter), printing the time it takes

system.time(out <- bulk_create(dat, dbname="bulktest"))
#>    user  system elapsed 
#>  16.832   6.039  24.432

The returned data is the same as with doc_create()

out[1:2]
#> [[1]]
#> [[1]]$ok
#> [1] TRUE
#> 
#> [[1]]$id
#> [1] "0063109bfb1c15765854cbc9525d8b8d"
#> 
#> [[1]]$rev
#> [1] "1-f407fe4935af7fd17c101f13d3c81679"
#> 
#> 
#> [[2]]
#> [[2]]$ok
#> [1] TRUE
#> 
#> [[2]]$id
#> [1] "0063109bfb1c15765854cbc9525d998b"
#> 
#> [[2]]$rev
#> [1] "1-cf8b9a9dcdc026052a663d6fef8a36fe"

So that's 20,000 rows in not that much time, not bad.

not dataframes

You can also pass in lists or vectors of json as character strings, e.g.,

lists

#> $ok
#> [1] TRUE
row.names(mtcars) <- NULL # get rid of row.names
lst <- parse_df(mtcars, tojson=FALSE)
db_create(dbname="bulkfromlist")
#> $ok
#> [1] TRUE
out <- bulk_create(lst, dbname="bulkfromlist")
out[1:2]
#> [[1]]
#> [[1]]$ok
#> [1] TRUE
#> 
#> [[1]]$id
#> [1] "ba70c46d73707662b1e204a90fcd9bb8"
#> 
#> [[1]]$rev
#> [1] "1-3946941c894a874697554e3e6d9bc176"
#> 
#> 
#> [[2]]
#> [[2]]$ok
#> [1] TRUE
#> 
#> [[2]]$id
#> [1] "ba70c46d73707662b1e204a90fcda9f6"
#> 
#> [[2]]$rev
#> [1] "1-273ff17a938cb956cba21051ab428b95"

json

#> $ok
#> [1] TRUE
strs <- as.character(parse_df(mtcars, "columns"))
db_create(dbname="bulkfromchr")
#> $ok
#> [1] TRUE
out <- bulk_create(strs, dbname="bulkfromchr")
out[1:2]
#> [[1]]
#> [[1]]$ok
#> [1] TRUE
#> 
#> [[1]]$id
#> [1] "ba70c46d73707662b1e204a90fce8c20"
#> 
#> [[1]]$rev
#> [1] "1-4b83d0ef53a28849a872d47ad03fef9a"
#> 
#> 
#> [[2]]
#> [[2]]$ok
#> [1] TRUE
#> 
#> [[2]]$id
#> [1] "ba70c46d73707662b1e204a90fce9bc1"
#> 
#> [[2]]$rev
#> [1] "1-c21bfa5425c67743f0826fd4b44b0dbf"

csl - an R client for Citation Style Language data

CSL (Citation Style Language) is used quite widely now to specify citations in a standard fashion. csl is an R client for exploring CSL styles, and is inspired by the Ruby gem csl. For example, csl is given back in the PLOS Lagotto article level metric API (follow http://alm.plos.org/api/v5/articles?ids=10.1371%252Fjournal.pone.0025110&info=detail&source_id=crossref).

Let me know if you have any feedback at the repo https://github.com/ropensci/csl

Install

install.packages("devtools")
devtools::install_github("ropensci/csl")
library("csl")

Load CSL style from a URL

You can load CSL styles from either a URL or a local file on your machine. Firt, from a URL. In this case from the Zotero style repository, for the American Journal or Political Science.

jps <- style_load('http://www.zotero.org/styles/american-journal-of-political-science')

A list is returned, which you can index to various parts of the style specification.

jps$info
#> $title
#> [1] "American Journal of Political Science"
#> 
#> $title_short
#> [1] "AJPS"
#> 
#> $id
#> [1] "http://www.zotero.org/styles/american-journal-of-political-science"
#> 
#> $author
...
jps$title
#> [1] "American Journal of Political Science"
jps$citation_format
#> [1] "author-date"
jps$links_template
#> [1] "http://www.zotero.org/styles/american-political-science-association"
jps$editor
#> $editor
#> $editor$variable
#> [1] "editor"
#> 
#> $editor$delimiter
#> [1] ", "
#> 
#> 
#> $label
#> $label$form
...
jps$author
#> $author
#> $author$variable
#> [1] "author"
#> 
#> 
#> $label
#> $label$form
#> [1] "short"
#> 
#> $label$prefix
...

Get raw XML

You can also get raw XML if you'd rather deal with that format.

style_xml('http://www.zotero.org/styles/american-journal-of-political-science')
#> <?xml version="1.0" encoding="utf-8"?>
#> <style xmlns="http://purl.org/net/xbiblio/csl" class="in-text" version="1.0" demote-non-dropping-particle="sort-only" default-locale="en-US">
#>   <info>
#>     <title>American Journal of Political Science</title>
#>     <title-short>AJPS</title-short>
#>     <id>http://www.zotero.org/styles/american-journal-of-political-science</id>
#>     <link href="http://www.zotero.org/styles/american-journal-of-political-science" rel="self"/>
#>     <link href="http://www.zotero.org/styles/american-political-science-association" rel="template"/>
#>     <link href="http://www.ajps.org/AJPS%20Style%20Guide.pdf" rel="documentation"/>
#>     <author>
...

Get styles

There is a GitHub repository of CSL styles at https://github.com/citation-style-language/styles-distribution. These don't come with the csl package, so you have to run get_styles() to get them on your machine. The default path is Sys.getenv("HOME")/styles, which for me is /Users/sacmac/styles. You can change where files are saved by using the path parameter.

get_styles()
#> 
#> Done! Files put in /Users/sacmac/styles

After getting styles locally you can load them just as we did with style_load(), but from your machine. However, since the file is local, we can make this easier by allowing just the name of the style, like

style_load("apa")
#> $info
#> $info$title
#> [1] "American Psychological Association 6th edition"
#> 
#> $info$title_short
#> [1] "APA"
#> 
#> $info$id
#> [1] "http://www.zotero.org/styles/apa"
#> 
...

If you are unsure if a style exists, you can use style_exists()

style_exists("helloworld")
#> [1] FALSE
style_exists("acs-nano")
#> [1] TRUE

In addition, you can list the path for a single style, more than 1, or all styles with styles()

styles("apa")
#> [1] "/Users/sacmac/styles/apa.csl"

All of them, truncated for blog brevity

styles()
#> $independent
#>    [1] "academy-of-management-review"                                                         
#>    [2] "acm-sig-proceedings-long-author-list"                                                 
#>    [3] "acm-sig-proceedings"                                                                  
#>    [4] "acm-sigchi-proceedings-extended-abstract-format"                                      
#>    [5] "acm-sigchi-proceedings"                                                               
#>    [6] "acm-siggraph"                                                                         
#>    [7] "acs-nano"                                                                             
#>    [8] "acta-anaesthesiologica-scandinavica"                                                  
#>    [9] "acta-anaesthesiologica-taiwanica"                                                     
...

Get locales

In addition to styles, there is a GitHub repo for locales at https://github.com/citation-style-language/locales. These also don't come with the csl package, so you have to run get_locales() to get them on your machine. Same goes here for paths as above for styles.

get_locales()
#> 
#> Done! Files put in /Users/sacmac/locales