The problem

The R community has a package distribution thing called CRAN just like Ruby has Rubygems, and Python has Pypi, etc. On all packages on CRAN, the CRAN maintainers run checks on each package on multiple versions of R and on many operating systems. They report those results on a page associated with the package, like this one.

You might be thinking: okay, but we have Travis-CI and friends, so who cares about that? Well, it’s these checks that CRAN runs that will determine if your package on CRAN leads to emails to you asking for changes, and possibly the package being taken down if e.g., they email and you don’t respond for a period of time.

So CRAN provides these package checks. Now what? Ideally, these would be available through an API so that the data is machine readable, which then makes many other things possible (see What’s Next below).

So how to build the API?

Building the CRAN checks API

On GitHub: https://github.com/ropensci/cchecksapi

My main goal learning goals with this API tech wise were two fold:

  • learn how to dockerize the application
  • learn how to use MongoDB

I hadn’t Dockerized a web API myself before, so that was an important goal - and I had actually never used MongoDB, but wanted to give it a shot to get familiar with it.

The whole stack is:

  • language: Ruby
  • web API framework: Sinatra
  • http Ruby gem: faraday
  • database: mongodb
  • server: caddy
  • container: all wrapped up in docker (docker-compose)
  • hosting: Amazon EC2
  • scheduling: crontab

At a high level, the system is as so:

  • Once a day a few Ruby scripts (for packages, for maintainers):
    • collects the names of packages on CRAN from Gábor Csárdi’s https://crandb.r-pkg.org API and maintainer emails from CRAN itself, then
    • goes out to the CRAN website and collects check results for each package, then
    • insert data into a MongoDB database
  • The API provides routes for getting data on specific packages by name, or all packages, and data on all packages for any given maintainers email adddress, or all maintainers
    • API calls make a query into the MongoDB database matching on the package name or maintainer email address
    • data is given back as JSON

The API doesn’t currently use caching, but may add if it seems needed.

Ruby and Sinatra

I really like Ruby. It’s a language that is fun to use, the community is great, and there’s tons of packages. Ruby is great for making web stuff, including web APIs. When doing web stuff, for me that means web APIs. For web APIs in Ruby, Rails is too heavy for all the stuff I do - that’s where Sinatra comes in.

Sinatra is a lightweight framework for making web apps/APIs. I make all my web APIs with Sinatra, and have had few complaints. Some may say “you should use X or Y because faster”, or whatever, but Sinatra is plenty fast for my use cases. Not every use case is “we’re Facebook”, or “we’re Google”.

Until recently I’ve been very much manually managing my Sinatra web APIs on servers - that is, installing/updating everything on the server itself, without using containers or any configuration management. This blog post is the blog post I would have wanted to read when I was figuring out how to dockerize my web APIs.

The API

The main meat of the API is definitions of routes. In addition, I’ve included a number of rules about what HTTP verbs are allowed to be used, what headers to send in each response, how to respond to client and server failures, etc.

This is what one of the route definitions looks like:

get '/pkgs/?' do
  headers_get
  begin
    lim = (params[:limit] || 10).to_i
    off = (params[:offset] || 0).to_i
    raise Exception.new('limit too large (max 1000)') unless lim <= 1000
    d = $cks.find({}, {"limit" => lim, "skip" => off})
    dat = d.to_a
    raise Exception.new('no results found') if d.nil?
    { found: d.count, count: dat.length, offset: nil, error: nil,
      data: dat }.to_json
  rescue Exception => e
    halt 400, { count: 0, error: { message: e.message }, data: nil }.to_json
  end
end

This code chunk is for the /pkgs route on the API (check it out at https://cranchecks.info/pkgs). The headers_get bit sends a pre-defined set of headers in the response. The begin ... rescue ... end bit is a “try catch” thing - leading to a JSON failure response in case there is a failure - and a JSON response on success.

Collecting data and MongoDB

As stated above, data is updated once a day. The code for scraping data on the package level and maintainer level is pretty similar. For both, the steps are the following: a) collect all names (for /pkgs that’s package names from https://crandb.r-pkg.org; for /maintainers that’s maintainer email addresses from https://cran.rstudio.com/web/checks/check_summary_by_maintainer.html), b) for each package name or maintainer email scrape CRAN check results, c) with all data collected drop data in MongoDB and then load all new data (maybe this could be an update step?). You can see the gory details on GitHub for packages and maintainers.

Those steps above in code for packages is like the following:

def scrape_all
  pkgs = cran_packages; # get all pkg names
  out = [] # make an array
  pkgs.each do |x| # for each pkg, scrape check results
    out << scrape_pkg(x)
  end
  if $cks.count > 0
    $cks.drop # drop data in Mongo
    $cks = $mongo[:checks] # recreate database in Mongo
  end
  $cks.insert_many(out.map { |e| prep_mongo(e) }) # load new data into Mongo
end

$cks is the MongoDB database connection.

Docker

For containerizing the API, I used Docker. A colleague had used Docker Compose, and it was a really easy experience spinning up and taking down the application we were working on. So I wanted to learn how to do that myself. After trial and error, finally got to a solution for this API. Here is my docker-compose.yml file:

mongo:
  image: mongo
  volumes:
    - $HOME/data/mongodb:/data/db # persists data to disk outside container
  restart: always
  ports:
    - "27017:27017"

api:
  build: .
  ports:
    - "8834:8834"
  links:
    - mongo

This specifies the container for MongoDB and for the API, and specifies in the API container to link to the mongo container.

To build and run do

docker-compose build && docker-compose up -d

The -d flag is for daemonize, i.e., run in the background. To kill them run

docker-compose stop && docker-compose rm

Caddy server

Caddy is great server. I never really used Nginx, so I can’t compare the two really - I just know that Caddy is super easy. To install, check out the installation page https://caddyserver.com/download, and it’s easy as something like curl https://getcaddy.com | bash -s personal (depends on configuration options on that page and license choice).

I know there’s an option to run a separate container with Caddy, but I run Caddy outside containers.

My Caddyfile has something similar to the following:

cranchecks.info {
  gzip
  tls email@foobar.com

  log / logfile.log "{remote} - [{when}] {method} {uri} {query} {proto} {status} {size} {>User-Agent}" {
     rotate_size 3
  }

  proxy / localhost:8834 {
    transparent
  }
}

What’s Next

There’s still more work to do.

  • Better /maintainers results
    • right now, we have two arrays of results, one from the html table on the CRAN results page and the other from the text below it. This duplication isn’t ideal.
    • it would be helpful to have a summary across all packages for any given maintainer
  • Better /pkgs results
    • it would be helpful to have a summary across all R versions and platforms for any given package
  • Include actual CRAN check results - CRAN check results can include output of the failures (whether they’re in examples or the test suite, or an installation error). The API doesn’t currently include that output, but thinking about how it could.
  • Possibly update data more often. Right now we update once per day. Seems like results do roll in at different times though, perhaps as builds are done for each pkg?
  • Notification service: package maintainers can opt-in to notifications when their checks are failing so they can be on top of fixes quickly. This could be managed through the API itself, with no GUI, but to make it palatable to all types may want to make a super simple web page to sign up.

Check out the issue tracker to follow progress or file a feature request or bug.


Thanks

Thanks to Gábor Csárdi for the idea to make a /maintainers route.

Further reading

In a similar post Cloud66 folks talked about deploying an API with the same stack essentially: Sinatra, MongoDB, and Docker.

p.s.

I mostly write about R software, so some readers may use R: if you want to make a web API but only know R, try learning Ruby! It can’t hurt to learn Ruby, and you’ll be happy you did.