With the help of user input, I’ve tweaked solr
just a bit to make things faster using default setings. I imagine the main interface for people using the solr
R client is via solr_search()
, which used to have wt=json
by default. Changing this to wt=csv
gives better performance. And it sorta makes sense to use csv, as the point of using an R client is probably do get data eventually into a data.frame, so it makes sense to go csv format (Already in tabular format) if it’s faster too.
Install
Install and load solr
devtools::install_github("ropensci/solr")
library("solr")
library("microbenchmark")
Setup
Define base url and fields to return
url <- 'https://api.plos.org/search'
fields <- c('id','cross_published_journal_name','cross_published_journal_key',
'cross_published_journal_eissn','pmid','pmcid','publisher','journal',
'publication_date','article_type','article_type_facet','author',
'author_facet','volume','issue','elocation_id','author_display',
'competing_interest','copyright')
json
The previous default for solr_search()
used json
solr_search(q='*:*', rows=10, fl=fields, base=url, wt = "json")
#> Source: local data frame [10 x 19]
#>
#> id
#> 1 10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4
#> 2 10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/title
#> 3 10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/abstract
#> 4 10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/references
#> 5 10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/body
#> 6 10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525
#> 7 10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/title
#> 8 10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/abstract
#> 9 10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/references
#> 10 10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/body
#> Variables not shown: cross_published_journal_name (chr),
#> cross_published_journal_key (chr), cross_published_journal_eissn (chr),
#> pmid (chr), pmcid (chr), publisher (chr), journal (chr),
#> publication_date (chr), article_type (chr), article_type_facet (chr),
#> author (chr), author_facet (chr), volume (int), issue (int),
#> elocation_id (chr), author_display (chr), competing_interest (chr),
#> copyright (chr)
csv
The default wt
setting is now csv
solr_search(q='*:*', rows=10, fl=fields, base=url, wt = "json")
#> Source: local data frame [10 x 19]
#>
#> id
#> 1 10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4
#> 2 10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/title
#> 3 10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/abstract
#> 4 10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/references
#> 5 10.1371/annotation/856f0890-9d85-4719-8e54-c27530ac94f4/body
#> 6 10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525
#> 7 10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/title
#> 8 10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/abstract
#> 9 10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/references
#> 10 10.1371/annotation/8551e3d5-fdd5-413b-a253-170ba13b7525/body
#> Variables not shown: cross_published_journal_name (chr),
#> cross_published_journal_key (chr), cross_published_journal_eissn (chr),
#> pmid (chr), pmcid (chr), publisher (chr), journal (chr),
#> publication_date (chr), article_type (chr), article_type_facet (chr),
#> author (chr), author_facet (chr), volume (int), issue (int),
#> elocation_id (chr), author_display (chr), competing_interest (chr),
#> copyright (chr)
Compare times
When parsing to a data.frame (which solr_search()
does by default), csv is quite a bit faster.
microbenchmark(
json = solr_search(q='*:*', rows=500, fl=fields, base=url, wt = "json", verbose = FALSE),
csv = solr_search(q='*:*', rows=500, fl=fields, base=url, wt = "csv", verbose = FALSE),
times = 20
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> json 965.7043 1013.014 1124.1229 1086.3225 1227.9054 1441.8332 20 b
#> csv 509.6573 520.089 541.5784 532.4546 548.0303 723.7575 20 a
json vs xml vs csv
When getting raw data, csv is best, json next, then xml pulling up the rear.
microbenchmark(
json = solr_search(q='*:*', rows=1000, fl=fields, base=url, wt = "json", verbose = FALSE, raw = TRUE),
csv = solr_search(q='*:*', rows=1000, fl=fields, base=url, wt = "csv", verbose = FALSE, raw = TRUE),
xml = solr_search(q='*:*', rows=1000, fl=fields, base=url, wt = "xml", verbose = FALSE, raw = TRUE),
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> json 1110.9515 1142.478 1198.9981 1169.0808 1195.5709 1518.7412 10 b
#> csv 801.6871 802.516 826.0655 819.1532 835.0512 873.4266 10 a
#> xml 1507.1111 1554.002 1618.5963 1617.5208 1671.0026 1740.4448 10 c
Notes
Note that wt=csv
is only available in solr_search()
and solr_all()
because csv writer
only returns the docs element in csv, dropping other elements, including facets, mlt, groups,
stats, etc.
Also, note the http client used in solr
is httr
, which passes in a gzip compression header by default, so as long as the server serving up the Solr data has compression turned on, that’s all set.
Another way I’ve sped things up is if you use wt=json
then parse to a data.frame, it uses dplyr
which sped things up considerably.