Recology

R/etc.

 

Elasticsearch backup and restore

setup backup

curl -XPUT 'http://localhost:9200/_snapshot/my_backup/' -d '{
    "type": "fs",
    "settings": {
        "location": "/Users/sacmac/esbackups/my_backup",
        "compress": true
    }
}'

create backup

http PUT "localhost:9200/_snapshot/my_backup/snapshot_2?wait_for_completion=true"

get info on snapshot

http "localhost:9200/_snapshot/my_backup/snapshot_2"

restore

curl -XPOST "localhost:9200/_snapshot/my_backup/snapshot_2/_restore"

partial restore, including various options that can be used

curl -XPOST "localhost:9200/_snapshot/my_backup/snapshot_2/_restore" -d '{
    "indices": "index_1,index_2",
    "ignore_unavailable": "true",
    "include_global_state": false,
    "rename_pattern": "index_(.+)",
    "rename_replacement": "restored_index_$1"
}'

note to self, secure elasticsearch

Recently I spun up a box on a cloud hosting provider planning to make a tens of thousdands of queries to an Elasticsearch instance on the same box. I could have done this on my own machine, but didn't want to take up compute resources.

I installed R and Elasticsearch on the box, then went about doing my thang.

A day later when things were still running, the hosting provider sent me a message that apparently my box had been serving up a DDoS attack.

This was incredibly surprising, as I don't even know how to do such a thing.

After some digging it seems that the culprit was likely Elasticsearch, as a number of tutorials/blog posts state that Elaticsearch is insecure by default, so if it's exposed on a public port, someone can hack in. I had only used Elasticsearch locally on my own machine, so I hadn't thought about security. Here's a few resources for security help:

Trying to narrow down the various pieces of advice for securing Elasticsearch, here's a list:

  • Use iptables (or rather nftables?) to firewall the box
  • Whitelist certain trusted IPs
  • Use the elasticsearch-http-basic plugin, adds basic username/pwd login
  • Remove public access: use network.bind_host: localhost and script.disable_dynamic: true in the elasticsearch.yml config file from

Elasticsearch provides a new feature for security that's built into Elasticsearch, Shield, but I believe it's only available to enterprise customers. Boo.

Package development

Someone asked recently about tips for package development workflow to optimize a successful submission to CRAN.

The ultimate guide is probably Hadley's book on package development, but here's more of a bulleted list of some things I do.

Use RStudio

Choice of text editor/IDE is always contentious, but for R package development, RStudio makes it so easy, including keyboard shortcuts for lots of steps that you need to make development faster. See the cheatsheet.

Documentation and roxygen2

You can always write your manual files (.Rd) files by hand, but to avoid mistakes including missing and duplicate parameter definitions, and other things, simply write documentation alongside your code with roxygen2. The RStudio IDE includes a keyboard shortcut (shift+cmd+D on Mac) to generate manual files from your roxygen documentation.

When you run either R CMD CHECK in your terminal or devtools::check() or simply using keyboard shortcuts in RStudio, you may notice problems with documentation, upon which you can make fixes, quickly re-document the whole package, then run check again. Iterating on this process is very easy with RStudio keyboard shortcuts.

Examples

CRAN checks now actually run code examples wrapped in \donttest. So if you have examples that may throw warnings or errors on purpose or accident, make sure to wrap them in \dontrun. Ripley used to complain that at least some examples in the package should run on check, but I haven't heard this complaint in a while.

First submission of the package?

If it is your first submission of the package:

  • Include the sentence in your submission I have read and agree to the the CRAN policies at http://cran.r-project.org/web/packages/policies.html

Code

CRAN maintainers generally don't look at code in my experience, but they may in the case of some examples or tests not passing on submission.

Tests

If you have tests in your package, and you should, think about whether your tests are likely to fail in some scenarios. For example, I mostly write packages that work with web APIs, all of which are not under my control, meaning they could fail at any time, causing tests to fail on CRAN (CRAN runs check once per day I think).

If you do have tests may fail, think about ignoring tests all together on CRAN. If you do this, it's especially important to use continuous integration on your own. For example, use Travis-CI, which will run check on your package on each change. If you have a package that works with web APIs, it's important to check your package functionality even when you aren't changing it since the resource your package works with can change. So run tests e.g. once per day - you can do something like we do using a bit of Ruby code.

Continuous integration

I just talked about this above, but a few more thoughts. This is a relatively easy thing to do, its free, and at least I think it greatly pays off once set up. In addition, you can do more than just test code, and run checks. You can deploy artifacts to various places. That is, for example, you could at the end of a build on Travis-CI, push a binary of the package to Dropbox, or Amazon S3. A few good options that I've used:

There are other options, but I haven't used them...

DESCRIPTION file

In addition to following CRAN's guidelines (and search description in the CRAN policies), some tips for some of the parts of the file.

  • Title: must be sentence case, no period at end
  • Description: Don't use the phrase This package
  • Watch out for possibly mis-spelled words warnings on check. They will reject your package for very minor mis-spellings.

Use cran-comments.md file

Hadley supports this in devtools. Put a file named cran-comments.md in the root of your package. In this file, include the comments you would submit for the package (e.g., I agree to cran policies...this package passed all checks...etc). Rembmer to put cran-comments.md in the .Rbuildignore file in the root of your package so that R CMD CHECK doesn't complain. When you use the devtools::release() function, it will look for this file, and automatically throw in your submission comments. Doing this and using release() means you don't have to worry about Brian Ripley complaining about rich text emails.

CRAN policy changes

If you're on Twitter, watch the #rstats hashtag to be more likely to notice any upcoming changes in package submission policies. Also can follow Dirk's CRAN policy watch repo.

Other things