elastic - Elasticsearch from R

We’ve (ropensci) been working on an R client for interacting with Elasticsearch for a while now, first commit was November 2013.

Elasticsearch is a document database built on the JVM. elastic interacts with the Elasticsearch HTTP API, and includes functions for setting connection details to Elasticsearch instances, loading bulk data, searching for documents with both HTTP query variables and JSON based body requests. In addition, elastic provides functions for interacting with APIs for indices, documents, nodes, clusters, an interface to the cat API, and more.

Here’s a few examples of what you can do.

Note: elastic was just pushed to CRAN. It just got accepted, so binaries may not be available, try again soon, or install from Github, or install from source from CRAN like install.packages("https://cran.r-project.org/src/contrib/elastic_0.3.0.tar.gz", repos=NULL, type="source").

Installation

install.packages("elastic")

Or install development version:

install.packages("devtools")
devtools::install_github("ropensci/elastic")

Then load package

library("elastic")

Install Elasticsearch

Elasticsearch installation help

Unix (linux/osx)

Replace 1.4.1 with the version you are working with.

Download zip or tar file from Elasticsearch see here for download, e.g., curl -L -O https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.1.tar.gz
Uncompress it: tar -xvf elasticsearch-1.4.1.tar.gz
Move it: sudo mv /path/to/elasticsearch-1.4.1 /usr/local
Navigate to /usr/local: cd /usr/local
Add shortcut: sudo ln -s elasticsearch-1.4.1 elasticsearch

On OSX, you can install via Homebrew: brew install elasticsearch

Windows

Windows users can follow the above, but unzip the zip file instead of uncompressing the tar file.

Start Elasticsearch

Navigate to elasticsearch: cd /usr/local/elasticsearch
Start elasticsearch: bin/elasticsearch

I create a little bash shortcut called es that does both of the above commands in one step (cd /usr/local/elasticsearch && bin/elasticsearch).

Note: Windows users should run the elasticsearch.bat file

Initialize connection

The function connect() is used before doing anything else to set the connection details to your remote or local elasticsearch store. The details created by connect() are written to your options for the current session, and are used by elastic functions.

connect()

On package load, your base url and port are set to https://127.0.0.1 and 9200, respectively. You can of course override these settings per session or for all sessions.

Get data

Elasticsearch has a bulk load API to load data in fast. The format is pretty weird though. It’s sort of JSON, but would pass no JSON linter. I include a few data sets in elastic so it’s easy to get up and running, and so when you run examples in this package they’ll actually run the same way (hopefully).

Shakespeare data

Elasticsearch provides some data on Shakespeare plays. I’ve provided a subset of this data in this package. Get the path for the file specific to your machine:

shakespeare <- system.file("examples", "shakespeare_data.json", package = "elastic")

Then load the data into Elasticsearch:

docs_bulk(shakespeare)

Public Library of Science (PLOS) data

A dataset inluded in the elastic package is metadata for PLOS scholarly articles.

plosdat <- system.file("examples", "plos_data.json", package = "elastic")
docs_bulk(plosdat)

Global Biodiversity Information Facility (GBIF) data

A dataset inluded in the elastic package is data for GBIF species occurrence records. Get the file path, then load:

gbifdat <- system.file("examples", "gbif_data.json", package = "elastic")
docs_bulk(gbifdat)

GBIF geo data with a coordinates element to allow geo_shape queries

gbifgeo <- system.file("examples", "gbif_geo.json", package = "elastic")
docs_bulk(gbifgeo)

The Search function

The main interface to searching documents in your Elasticsearch store is the function Search(). I nearly always develop R software using all lowercase, but R has a function called search(), and I wanted to avoid collision with that function.

Search() is an interface to both the HTTP search API (in which queries are passed in the URI of the request, meaning queries have to be relatively simple), as well as the POST API, or the Query DSL, in which queries are passed in the body of the request (so can be much more complex).

There are a huge amount of ways you can search Elasticsearch documents - this tutorial covers some of them, and highlights the ways in which you interact with the R outputs.

Search an index

out <- Search(index="shakespeare")
out$hits$total
#> [1] 5000

out$hits$hits[[1]]
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "4"
#> 
#> $`_version`
#> [1] 1
#> 
#> $`_score`
#> [1] 1
#> 
#> $`_source`
#> $`_source`$line_id
#> [1] 5
#> 
#> $`_source`$play_name
#> [1] "Henry IV"
#> 
#> $`_source`$speech_number
#> [1] 1
#> 
#> $`_source`$line_number
#> [1] "1.1.2"
#> 
#> $`_source`$speaker
#> [1] "KING HENRY IV"
#> 
#> $`_source`$text_entry
#> [1] "Find we a time for frighted peace to pant,"

Search an index by type

Search(index="shakespeare", type="act")$hits$hits[[1]]
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "act"
#> 
#> $`_id`
#> [1] "2227"
#> 
#> $`_version`
#> [1] 1
#> 
#> $`_score`
#> [1] 1
#> 
#> $`_source`
#> $`_source`$line_id
#> [1] 2228
#> 
#> $`_source`$play_name
#> [1] "Henry IV"
#> 
#> $`_source`$speech_number
#> [1] 81
#> 
#> $`_source`$line_number
#> [1] ""
#> 
#> $`_source`$speaker
#> [1] "FALSTAFF"
#> 
#> $`_source`$text_entry
#> [1] "ACT IV"

Return certain fields

Search(index="shakespeare", fields=c('play_name','speaker'))$hits$hits[[1]]
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "4"
#> 
#> $`_version`
#> [1] 1
#> 
#> $`_score`
#> [1] 1
#> 
#> $fields
#> $fields$speaker
#> $fields$speaker[[1]]
#> [1] "KING HENRY IV"
#> 
#> 
#> $fields$play_name
#> $fields$play_name[[1]]
#> [1] "Henry IV"

Sorting

Search(index="shakespeare", type="act", sort="text_entry")$hits$hits[1:2]
#> [[1]]
#> [[1]]$`_index`
#> [1] "shakespeare"
#> 
#> [[1]]$`_type`
#> [1] "act"
#> 
#> [[1]]$`_id`
#> [1] "2227"
#> 
#> [[1]]$`_version`
#> [1] 1
#> 
#> [[1]]$`_score`
#> NULL
#> 
#> [[1]]$`_source`
#> [[1]]$`_source`$line_id
#> [1] 2228
#> 
#> [[1]]$`_source`$play_name
#> [1] "Henry IV"
#> 
#> [[1]]$`_source`$speech_number
#> [1] 81
#> 
#> [[1]]$`_source`$line_number
#> [1] ""
#> 
#> [[1]]$`_source`$speaker
#> [1] "FALSTAFF"
#> 
#> [[1]]$`_source`$text_entry
#> [1] "ACT IV"
#> 
#> 
#> [[1]]$sort
#> [[1]]$sort[[1]]
#> [1] "act"
#> 
#> 
#> 
#> [[2]]
#> [[2]]$`_index`
#> [1] "shakespeare"
#> 
#> [[2]]$`_type`
#> [1] "act"
#> 
#> [[2]]$`_id`
#> [1] "2633"
#> 
#> [[2]]$`_version`
#> [1] 1
#> 
#> [[2]]$`_score`
#> NULL
#> 
#> [[2]]$`_source`
#> [[2]]$`_source`$line_id
#> [1] 2634
#> 
#> [[2]]$`_source`$play_name
#> [1] "Henry IV"
#> 
#> [[2]]$`_source`$speech_number
#> [1] 9
#> 
#> [[2]]$`_source`$line_number
#> [1] ""
#> 
#> [[2]]$`_source`$speaker
#> [1] "ARCHBISHOP OF YORK"
#> 
#> [[2]]$`_source`$text_entry
#> [1] "ACT V"
#> 
#> 
#> [[2]]$sort
#> [[2]]$sort[[1]]
#> [1] "act"

Paging

Search(index="shakespeare", size=1, from=1, fields='text_entry')$hits
#> $total
#> [1] 5000
#> 
#> $max_score
#> [1] 1
#> 
#> $hits
#> $hits[[1]]
#> $hits[[1]]$`_index`
#> [1] "shakespeare"
#> 
#> $hits[[1]]$`_type`
#> [1] "line"
#> 
#> $hits[[1]]$`_id`
#> [1] "9"
#> 
#> $hits[[1]]$`_version`
#> [1] 1
#> 
#> $hits[[1]]$`_score`
#> [1] 1
#> 
#> $hits[[1]]$fields
#> $hits[[1]]$fields$text_entry
#> $hits[[1]]$fields$text_entry[[1]]
#> [1] "Nor more shall trenching war channel her fields,"

Queries

Using the q parameter you can pass in a query, which gets passed in the URI of the query. This type of query is less powerful than the below query passed in the body of the request, using the body parameter.

Search(index="shakespeare", type="act", q="speaker:KING HENRY IV")$hits$total
#> [1] 9

Query DSL searches - queries sent in the body of the request

Pass in as an R list

aggs <- list(aggs = list(stats = list(terms = list(field = "text_entry"))))
Search(index="shakespeare", body=aggs)$hits$hits[[1]]
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "4"
#> 
#> $`_version`
#> [1] 1
#> 
#> $`_score`
#> [1] 1
#> 
#> $`_source`
#> $`_source`$line_id
#> [1] 5
#> 
#> $`_source`$play_name
#> [1] "Henry IV"
#> 
#> $`_source`$speech_number
#> [1] 1
#> 
#> $`_source`$line_number
#> [1] "1.1.2"
#> 
#> $`_source`$speaker
#> [1] "KING HENRY IV"
#> 
#> $`_source`$text_entry
#> [1] "Find we a time for frighted peace to pant,"

Or pass in as json query with newlines, easy to read

aggs <- '{
    "aggs": {
        "stats" : {
            "terms" : {
                "field" : "text_entry"
            }
        }
    }
}'
Search(index="shakespeare", body=aggs)$hits$hits[[1]]
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "4"
#> 
#> $`_version`
#> [1] 1
#> 
#> $`_score`
#> [1] 1
#> 
#> $`_source`
#> $`_source`$line_id
#> [1] 5
#> 
#> $`_source`$play_name
#> [1] "Henry IV"
#> 
#> $`_source`$speech_number
#> [1] 1
#> 
#> $`_source`$line_number
#> [1] "1.1.2"
#> 
#> $`_source`$speaker
#> [1] "KING HENRY IV"
#> 
#> $`_source`$text_entry
#> [1] "Find we a time for frighted peace to pant,"

Or pass in collapsed json string

aggs <- '{"aggs":{"stats":{"terms":{"field":"text_entry"}}}}'
Search(index="shakespeare", body=aggs)$hits$hits[[1]]
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "4"
#> 
#> $`_version`
#> [1] 1
#> 
#> $`_score`
#> [1] 1
#> 
#> $`_source`
#> $`_source`$line_id
#> [1] 5
#> 
#> $`_source`$play_name
#> [1] "Henry IV"
#> 
#> $`_source`$speech_number
#> [1] 1
#> 
#> $`_source`$line_number
#> [1] "1.1.2"
#> 
#> $`_source`$speaker
#> [1] "KING HENRY IV"
#> 
#> $`_source`$text_entry
#> [1] "Find we a time for frighted peace to pant,"

Fuzzy query

Fuzzy query on numerics

fuzzy <- list(query = list(fuzzy = list(speech_number = 7)))
Search(index="shakespeare", body=fuzzy)$hits$total
#> [1] 523

fuzzy <- list(query = list(fuzzy = list(speech_number = list(value = 7, fuzziness = 4))))
Search(index="shakespeare", body=fuzzy)$hits$total
#> [1] 1499

Range query

With numeric

body <- list(query=list(range=list(decimalLongitude=list(gte=1, lte=3))))
Search('gbif', body=body)$hits$total
#> [1] 24

body <- list(query=list(range=list(decimalLongitude=list(gte=2.9, lte=10))))
Search('gbif', body=body)$hits$total
#> [1] 166

With dates

body <- list(query=list(range=list(eventDate=list(gte="2012-01-01", lte="now"))))
Search('gbif', body=body)$hits$total
#> [1] 899

body <- list(query=list(range=list(eventDate=list(gte="2014-01-01", lte="now"))))
Search('gbif', body=body)$hits$total
#> [1] 685

Highlighting

body <- '{
 "query": {
   "query_string": {
     "query" : "cell"
   }
 },
 "highlight": {
   "fields": {
     "title": {"number_of_fragments": 2}
   }
 }
}'
out <- Search('plos', 'article', body=body)
out$hits$total
#> [1] 57

sapply(out$hits$hits, function(x) x$highlight$title[[1]])[8:10]
#> [1] "c-FLIP Protects Eosinophils from TNF-α-Mediated <em>Cell</em> Death In Vivo"                          
#> [2] "DUSP1 Is a Novel Target for Enhancing Pancreatic Cancer <em>Cell</em> Sensitivity to Gemcitabine"     
#> [3] "Carbon Ion Radiation Inhibits Glioma and Endothelial <em>Cell</em> Migration Induced by Secreted VEGF"

Scrolling search - instead of paging

Search('shakespeare', q="a*")$hits$total
#> [1] 2747
res <- Search(index = 'shakespeare', q="a*", scroll="1m")
res <- Search(index = 'shakespeare', q="a*", scroll="1m", search_type = "scan")
length(scroll(scroll_id = res$`_scroll_id`)$hits$hits)
#> [1] 50

res <- Search(index = 'shakespeare', q="a*", scroll="5m", search_type = "scan")
out <- list()
hits <- 1
while(hits != 0){
  res <- scroll(scroll_id = res$`_scroll_id`)
  hits <- length(res$hits$hits)
  if(hits > 0)
    out <- c(out, res$hits$hits)
}
length(out)
#> [1] 2747

Woohoo! Collected all 2747 documents in very little time.

The cat API

List cat methods

cat_()
#> =^.^=
#> /_cat/allocation
#> /_cat/shards
#> /_cat/shards/{index}
#> /_cat/master
#> /_cat/nodes
#> /_cat/indices
#> /_cat/indices/{index}
#> /_cat/segments
#> /_cat/segments/{index}
#> /_cat/count
#> /_cat/count/{index}
#> /_cat/recovery
#> /_cat/recovery/{index}
#> /_cat/health
#> /_cat/pending_tasks
#> /_cat/aliases
#> /_cat/aliases/{alias}
#> /_cat/thread_pool
#> /_cat/plugins
#> /_cat/fielddata
#> /_cat/fielddata/{fields}

Get aliases

cat_aliases()
#> things plos - - - 
#> stuff  plos - - -

Get indices

cat_indices()
#> yellow open plosmore     5 1  1000  0   3.5mb   3.5mb 
#> yellow open leotheadfadf 5 1     0  0    575b    575b 
#> red    open alsothat     3 2                          
#> yellow open gbif         5 1   899  0     1mb     1mb 
#> yellow open gbifgeopoint 5 1     0  0    575b    575b 
#> yellow open gbifnewgeo   5 1     2  0   5.8kb   5.8kb 
#> yellow open plos         5 1  1202 39  14.2mb  14.2mb 
#> yellow open leothedog    5 1     0  0    575b    575b 
#> yellow open shakespeare  5 1  5000  0     1mb     1mb 
#> yellow open gbifgeo      5 1   600  0 861.9kb 861.9kb 
#> yellow open plosbigdata  5 1 20000  0  53.6mb  53.6mb 
#> yellow open mapuris      5 1    31  0  34.4kb  34.4kb 
#> yellow open leothelion   5 1     0  0    575b    575b

Get nodes

cat_nodes()
#> Scotts-MacBook-Pro.local 192.168.1.104 6 79 3.44 d * Hellfire

Work with indices

out <- index_get(index='shakespeare')
names(out$shakespeare$mappings)
#> [1] "line"  "scene" "act"

Check for index existence

index_exists(index='shakespeare')
#> [1] TRUE

Delete an index

index_delete(index='plos')
#> $acknowledged
#> [1] TRUE

Create an index

index_create(index='twitter')
#> $acknowledged
#> [1] TRUE

Work with documents

Get a document

docs_get(index='shakespeare', type='line', id=10)
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "10"
#> 
#> $`_version`
#> [1] 1
#> 
#> $found
#> [1] TRUE
#> 
#> $`_source`
#> $`_source`$line_id
#> [1] 11
#> 
#> $`_source`$play_name
#> [1] "Henry IV"
#> 
#> $`_source`$speech_number
#> [1] 1
#> 
#> $`_source`$line_number
#> [1] "1.1.8"
#> 
#> $`_source`$speaker
#> [1] "KING HENRY IV"
#> 
#> $`_source`$text_entry
#> [1] "Nor bruise her flowerets with the armed hoofs"

Get certain fields

docs_get(index='shakespeare', type='line', id=10, fields=c('play_name','speaker'))
#> $`_index`
#> [1] "shakespeare"
#> 
#> $`_type`
#> [1] "line"
#> 
#> $`_id`
#> [1] "10"
#> 
#> $`_version`
#> [1] 1
#> 
#> $found
#> [1] TRUE
#> 
#> $fields
#> $fields$play_name
#> $fields$play_name[[1]]
#> [1] "Henry IV"
#> 
#> 
#> $fields$speaker
#> $fields$speaker[[1]]
#> [1] "KING HENRY IV"

Test for existence of the document

docs_get(index='plos', type='article', id=1, exists=TRUE)
#> [1] FALSE

docs_get(index='plos', type='article', id=123456, exists=TRUE)
#> [1] FALSE

Thats it

Let us know if you have any feedback!

Installation#

Install Elasticsearch#

Start Elasticsearch#

Initialize connection#

Get data#

Shakespeare data#

Public Library of Science (PLOS) data#

Global Biodiversity Information Facility (GBIF) data#

The Search function#

Search an index#

Search an index by type#

Return certain fields#

Sorting#

Paging#

Queries#

Query DSL searches - queries sent in the body of the request#

Fuzzy query#

Range query#

Highlighting#

Scrolling search - instead of paging#

The cat API#

Work with indices#

Work with documents#

Thats it#