07 Apr 2016
We (rOpenSci) just held our 3rd annual rOpenSci unconference (http://unconf16.ropensci.org/) in San Francisco. There were a lot of ideas, and lots of awesome projects from awesome people came out of the 2 day event.
One weird idea I had comes from looking at the Node world, where there are lots of tiny packages, instead of the often larger packages we have in the R world. One reason for tiny in Node is that of course you want a library to be tiny if running in the browser for faster load times (esp. on mobile).
So the idea is, what if we could separate all the functions in a package, or any particular function of your choice, into new packages, with all the internal functions and dependencies. And automatically as well, not manually.
So what are the use cases? I can’t imagine this being used to create stable packages to disperse to the world on CRAN, but it could be really useful for development purposes, or for R users/analysts that want lighter weight dependencies (e.g., a package with just the one function needed from a larger package).
This approach of course has drawbacks. The new created package is now broken apart from the original - however, beause it’s automated, you can just re-create it.
Another pain point would surely be with packages that have C/C++ code in them.
The package was made possible by the awesome functionMap package by Gábor Csárdi, and the more well-known
Expect bugs, the package has no tests. Sorry :(
atomize a fxn into separate package
You can select one function, or many. Here, I’m using another package I maintain, rredlist, a pkg to interact with the IUCN Redlist API.
In this example, I want a new package called
foobar with just the function
rl_citation(). The function
atomize::atomizer() takes the path for the package to extract from, then a path for the new package, then the function names.
atomizer(path_ref = "../rredlist", path_new = "../foobar", funcs = "rl_citation")
This creates a new package in the
Now we need to install the new package, can do easily with
Then load the new package
Now call the function in the new package
#>  "IUCN 2015. IUCN Red List of Threatened Species. Version 2015-4 <www.iucnredlist.org>"
it’s identical to the same function in the old package
#>  TRUE
29 Mar 2016
GenBank IDs, accession numbers and GI identifiers, are the two types of identifiers for entries in GenBank. (see this page for why there are two types of identifiers). Actually, recent news from NCBI is that GI identifiers will be phased out by September this year, which affects what I’ll talk about below.
There are a lot of sequences in GenBank. Sometimes you have identifiers and you want to check if they exist in GenBank, or want to get one type from another (accession from GI, or vice versa; although GI phase out will make this use case no longer needed), or just get a bunch of identifiers for software testing purposes perhaps.
Currently, the ENTREZ web services aren’t super performant or easy to use for the above use cases. Thus, I’ve built out a RESTful API for these use cases, called gbids.
gbids on GitHub: sckott/gbids
Here’s the tech stack:
- API: Ruby/Sinatra
- Storage: MySQL
- Caching: Redis
- each key cached for 3 hours
- Server: Caddy
- Authentication: none
Will soon have a cron job update when new dump is available every Sunday, but for now we’re about a month behind the current dump of identifiers. If usage of the API picks up, I’ll know it’s worth maintaining and make sure data is up to date.
The base url is https://gbids.xyz.
Here’s a rundown of the API routes:
/ re-routes to
/heartbeat - list routes
GET - list accession ids
GET - submit many accession numbers, get back boolean (match or no match)
GET - list gi numbers
GET - submit many gi numbers, get back boolean (match or no match)
GET - get gi numbers from accession numbers
GET - get accession numbers from gi numbers
Of course after GI identifiers are phased out, all
gi routes will be removed.
The API docs are at recology.info/gbidsdocs - let me know if you have any feedback on those.
I also have a status page up at recology.info/gbidsstatus - it’s not automated, I have to update the status manually, but I do update that page whenever I’m doing maintenance and the API will be down, or if it goes down due to any other reason.
Request to list accession identifiers, limit to 5
curl 'https://gbids.xyz/acc?limit=5' | jq .
Request to match accession identifiers, some exist, while some do not
curl 'https://gbids.xyz/acc/AACY024124486,AACY024124483,asdfd,asdf,AACY024124476' | jq .
I think it’d probably be worth adding routes for getting accession from taxonomy id and vice versa since that’s something that is currently not very easy. We could use the dumped accession2taxid data from ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/
If you think this could be potentially useful, do try it out and send any feedback. I watch logs from the API, so I’m hoping for an increase in usage so I know people find it useful.
Since servers aren’t free, costs add up, and if I don’t see usage pick up I’ll discontinue the service at some point. So do use it!
And if anyone can offer free servers, I’d gladly take advantage of that. I’ve applied for AWS research grant, but won’t hear back for a few months.
24 Mar 2016
GitHub issues are great for humans to correspond over software, or any other project. At rOpenSci we use an issue based software review system (ropensci/onboarding). Software authors and reviewers go back and forth on the software, making a better product in the end.
We have a relatively small number of pieces of software under review at any one time compared to e.g., scientific journals - however, even with the small number, we as organizers, and authors and reviewers can forget things. For example:
- an organizer can forget to remind a reviewer to get a review in
- a reviewer can forget about a review, and may benefit from a friendly reminder
- an author may forget about updating software based on the review
As we are managing more package submissions through our system, automated things done by machine, or robot, will be increasingly helpful to keep the system moving smoothly.
A big red flag with automated systems is the annoyance factor. We can try to be smart about this and only remind people when it’s really needed.
I’ve been working on a thing for a while now, it’s called
heythere. It’s a Ruby application that is currently set up to run on Heroku, though you could run it anywhere you want. It’s running right now once per day to check to see if it should send any reminders to organizers, authors, reviewers.
heythere on GitHub: ropenscilabs/heythere
How it works
heythere is controlled through a series of environment variables that controls GitHub authentication, the first day post reviewer assignment when a reminder should be sent, how many days after reviews are submitted to ask if the author needs any help, and more. Check out the repo for all the env var options.
The Ruby app can be run via a rake task from the command line, or triggered with a scheduler on something like Heroku.
When the app runs, we look for environment variables that you set. If we don’t find them we use sensible defaults.
Using the env vars, we grab the issues for the repository you chose, limit to a subset of your choosing based on a series of labels, then compare dates on comments compared to your env vars, and either skip or send of comments on issues.
We use ockokit under the hood to work with GitHub issues.
How to use it
git clone email@example.com:ropenscilabs/heythere.git
Change the repo in
Rakefile to whatever your repository is.
Heythere.hey_there(repo = 'ropensci/onboarding')
Create the app (use a different name, of course)
heroku apps:create ropensci-hey-there
Add the repository that you are targeting:
heroku config:add HEYTHERE_REPOSITORY=<github-repository> (like `owner/repo`)
Create a GitHub personal access token just for this application. You’ll need to set a env var for your username and the token. We read these in the app.
heroku config:add GITHUB_USERNAME=<github-user>
heroku config:add GITHUB_PAT_OCTOKIT=<github-pat-for-octokit>
Optionally, set env vars for various options. You don’t have to set these - we’ll use defaults
heroku config:add HEYTHERE_PRE_DEADLINE_DAYS=<number-of-days-integer>
heroku config:add HEYTHERE_DEADLINE_DAYS=<number-of-days-integer>
heroku config:add HEYTHERE_POST_DEADLINE_EVERY_DAYS=<number-of-days-integer>
heroku config:add HEYTHERE_POST_REVIEW_IN_DAYS=<number-of-days-integer>
heroku config:add HEYTHERE_POST_REVIEW_TOGGLE=<boolean>
heroku config:add HEYTHERE_BOT_NICKNAME=<string>
Also save all these env vars in your
.zshrc, or similar so you can run the app locally. E.g. with entries like
You can see all your Heroku config vars using
heroku config or use
Push your app to Heroku
Add the scheduler to your heroku app
heroku addons:create scheduler:standard
heroku addons:open scheduler
Add the task
rake hey to your heroku scheduler and set to whatever schedule you want.
If you have your repo in an env var as above, run the rake task
If not, then pass the repo to
what does it look like?
This is what a comment looks like in a thread
You could set up a different GitHub account as your robot so it’s clearly not coming from a real person.
I’ll continue to improve
heythere as we get feedback on its use in
ropensci/onboarding. Would also love any feedback from you, internet. Thanks!