Recology

R/etc.

 

atomize - make new packages from other packages

We (rOpenSci) just held our 3rd annual rOpenSci unconference (http://unconf16.ropensci.org/) in San Francisco. There were a lot of ideas, and lots of awesome projects from awesome people came out of the 2 day event.

One weird idea I had comes from looking at the Node world, where there are lots of tiny packages, instead of the often larger packages we have in the R world. One reason for tiny in Node is that of course you want a library to be tiny if running in the browser for faster load times (esp. on mobile).

So the idea is, what if we could separate all the functions in a package, or any particular function of your choice, into new packages, with all the internal functions and dependencies. And automatically as well, not manually.

So what are the use cases? I can’t imagine this being used to create stable packages to disperse to the world on CRAN, but it could be really useful for development purposes, or for R users/analysts that want lighter weight dependencies (e.g., a package with just the one function needed from a larger package).

This approach of course has drawbacks. The new created package is now broken apart from the original - however, beause it’s automated, you can just re-create it.

Another pain point would surely be with packages that have C/C++ code in them.

The package: atomize.

The package was made possible by the awesome functionMap package by Gábor Csárdi, and the more well-known devtools.

Expect bugs, the package has no tests. Sorry :(

Installation

devtools::install_github("ropenscilabs/atomize")
library("atomize")

usage

atomize a fxn into separate package

You can select one function, or many. Here, I’m using another package I maintain, rredlist, a pkg to interact with the IUCN Redlist API.

In this example, I want a new package called foobar with just the function rl_citation(). The function atomize::atomizer() takes the path for the package to extract from, then a path for the new package, then the function names.

atomizer(path_ref = "../rredlist", path_new = "../foobar", funcs = "rl_citation")

This creates a new package in the path_new directory

install

Now we need to install the new package, can do easily with devtools::install()

devtools::install("../foobar")

load

Then load the new package

library("foobar")

call function

Now call the function in the new package

foobar::rl_citation()
#> [1] "IUCN 2015. IUCN Red List of Threatened Species. Version 2015-4 <www.iucnredlist.org>"

it’s identical to the same function in the old package

identical(rredlist::rl_citation(), foobar::rl_citation())
#> [1] TRUE

GenBank IDs API - get, match, swap id types

GenBank IDs, accession numbers and GI identifiers, are the two types of identifiers for entries in GenBank. (see this page for why there are two types of identifiers). Actually, recent news from NCBI is that GI identifiers will be phased out by September this year, which affects what I’ll talk about below.

There are a lot of sequences in GenBank. Sometimes you have identifiers and you want to check if they exist in GenBank, or want to get one type from another (accession from GI, or vice versa; although GI phase out will make this use case no longer needed), or just get a bunch of identifiers for software testing purposes perhaps.

Currently, the ENTREZ web services aren’t super performant or easy to use for the above use cases. Thus, I’ve built out a RESTful API for these use cases, called gbids.

gbids on GitHub: sckott/gbids

Here’s the tech stack:

  • API: Ruby/Sinatra
  • Storage: MySQL
  • Caching: Redis
    • each key cached for 3 hours
  • Server: Caddy
    • https
  • Authentication: none

Will soon have a cron job update when new dump is available every Sunday, but for now we’re about a month behind the current dump of identifiers. If usage of the API picks up, I’ll know it’s worth maintaining and make sure data is up to date.

The base url is https://gbids.xyz.

Here’s a rundown of the API routes:

  • / re-routes to /heartbeat
  • /heartbeat - list routes
  • /acc - GET - list accession ids
  • /acc/:id,:id,... - GET - submit many accession numbers, get back boolean (match or no match)
  • /acc - POST
  • /gi - GET - list gi numbers
  • /gi/:id,:id,... - GET - submit many gi numbers, get back boolean (match or no match)
  • /gi - POST
  • /acc2gi/:id,:id,... - GET - get gi numbers from accession numbers
  • /acc2gi - POST
  • /gi2acc/:id,:id,... - GET - get accession numbers from gi numbers
  • /gi2acc - POST

Of course after GI identifiers are phased out, all gi routes will be removed.

The API docs are at recology.info/gbidsdocs - let me know if you have any feedback on those.

I also have a status page up at recology.info/gbidsstatus - it’s not automated, I have to update the status manually, but I do update that page whenever I’m doing maintenance and the API will be down, or if it goes down due to any other reason.

examples

Request to list accession identifiers, limit to 5

curl 'https://gbids.xyz/acc?limit=5' | jq .
{
  "matched": 692006925,
  "returned": 5,
  "data": [
    "A00002",
    "A00003",
    "X17276",
    "X60065",
    "CAA42669"
  ],
  "error": null
}

Request to match accession identifiers, some exist, while some do not

curl 'https://gbids.xyz/acc/AACY024124486,AACY024124483,asdfd,asdf,AACY024124476' | jq .
{
  "matched": 3,
  "returned": 5,
  "data": {
    "AACY024124486": true,
    "AACY024124483": true,
    "asdfd": false,
    "asdf": false,
    "AACY024124476": true
  },
  "error": null
}

to do

I think it’d probably be worth adding routes for getting accession from taxonomy id and vice versa since that’s something that is currently not very easy. We could use the dumped accession2taxid data from ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/

feedback!

If you think this could be potentially useful, do try it out and send any feedback. I watch logs from the API, so I’m hoping for an increase in usage so I know people find it useful.

Since servers aren’t free, costs add up, and if I don’t see usage pick up I’ll discontinue the service at some point. So do use it!

And if anyone can offer free servers, I’d gladly take advantage of that. I’ve applied for AWS research grant, but won’t hear back for a few months.

heythere - a robot to automate GitHub issue comments

GitHub issues are great for humans to correspond over software, or any other project. At rOpenSci we use an issue based software review system (ropensci/onboarding). Software authors and reviewers go back and forth on the software, making a better product in the end.

We have a relatively small number of pieces of software under review at any one time compared to e.g., scientific journals - however, even with the small number, we as organizers, and authors and reviewers can forget things. For example:

  • an organizer can forget to remind a reviewer to get a review in
  • a reviewer can forget about a review, and may benefit from a friendly reminder
  • an author may forget about updating software based on the review

As we are managing more package submissions through our system, automated things done by machine, or robot, will be increasingly helpful to keep the system moving smoothly.

A big red flag with automated systems is the annoyance factor. We can try to be smart about this and only remind people when it’s really needed.

I’ve been working on a thing for a while now, it’s called heythere. It’s a Ruby application that is currently set up to run on Heroku, though you could run it anywhere you want. It’s running right now once per day to check to see if it should send any reminders to organizers, authors, reviewers.

heythere on GitHub: ropenscilabs/heythere

How it works

heythere is controlled through a series of environment variables that controls GitHub authentication, the first day post reviewer assignment when a reminder should be sent, how many days after reviews are submitted to ask if the author needs any help, and more. Check out the repo for all the env var options.

The Ruby app can be run via a rake task from the command line, or triggered with a scheduler on something like Heroku.

When the app runs, we look for environment variables that you set. If we don’t find them we use sensible defaults.

Using the env vars, we grab the issues for the repository you chose, limit to a subset of your choosing based on a series of labels, then compare dates on comments compared to your env vars, and either skip or send of comments on issues.

We use ockokit under the hood to work with GitHub issues.

How to use it

clone

git clone git@github.com:ropenscilabs/heythere.git
cd heythere

setup

Change the repo in Rakefile to whatever your repository is.

Heythere.hey_there(repo = 'ropensci/onboarding')

Create the app (use a different name, of course)

heroku apps:create ropensci-hey-there

Add the repository that you are targeting:

heroku config:add HEYTHERE_REPOSITORY=<github-repository> (like `owner/repo`)

Create a GitHub personal access token just for this application. You’ll need to set a env var for your username and the token. We read these in the app.

heroku config:add GITHUB_USERNAME=<github-user>
heroku config:add GITHUB_PAT_OCTOKIT=<github-pat-for-octokit>

Optionally, set env vars for various options. You don’t have to set these - we’ll use defaults

heroku config:add HEYTHERE_PRE_DEADLINE_DAYS=<number-of-days-integer>
heroku config:add HEYTHERE_DEADLINE_DAYS=<number-of-days-integer>
heroku config:add HEYTHERE_POST_DEADLINE_EVERY_DAYS=<number-of-days-integer>
heroku config:add HEYTHERE_POST_REVIEW_IN_DAYS=<number-of-days-integer>
heroku config:add HEYTHERE_POST_REVIEW_TOGGLE=<boolean>
heroku config:add HEYTHERE_BOT_NICKNAME=<string>

Also save all these env vars in your .bash_profile, .zshrc, or similar so you can run the app locally. E.g. with entries like export HEYTHERE_PRE_DEADLINE_DAYS=15

You can see all your Heroku config vars using heroku config or use rake envs

Push your app to Heroku

git push heroku master

Add the scheduler to your heroku app

heroku addons:create scheduler:standard
heroku addons:open scheduler

Add the task rake hey to your heroku scheduler and set to whatever schedule you want.

usage

If you have your repo in an env var as above, run the rake task hey

rake hey

If not, then pass the repo to hey like

rake hey repo=owner/repo

what does it look like?

This is what a comment looks like in a thread

assertr_img

You could set up a different GitHub account as your robot so it’s clearly not coming from a real person.

feedback

I’ll continue to improve heythere as we get feedback on its use in ropensci/onboarding. Would also love any feedback from you, internet. Thanks!