Recology

R/etc.

Conditionality meta-analysis data

The paper

One paper from my graduate work asked most generally ~ "How much does the variation in magnitudes and signs of species interaction outcomes vary?". More specifically, we wanted to know if variation differed among species interaction classes (mutualism, competition, predation), and among various "gradients" (space, time, etc.). To answer this question, we used a meta-analysis approach (rather than e.g., a field experiment). We published the paper recently.

p.s. I really really wish we would have put it in an open access journal...

The data

Anyway, I'm here to talk about the data. We didn't get the data up with the paper, but it is up on Figshare now. The files there are the following:

  • coniditionality.R - script used to process the data from variables_prelim.csv
  • variables_prelim.csv - description of variables in the preliminary data set, matches conditionality_data_prelim.csv
  • variables_used.csv - description of variables in the used data set, matches conditionality_data_used.csv
  • conditionality_data_prelim.csv - preliminary data, the raw data
  • conditionality_data_used.csv - the data used for our paper
  • README.md - the readme
  • paper_selection.csv - the list of papers we went through, with remarks about paper selection

Please do play with the data, publish some papers, etc, etc. It took 6 of us about 4 years to collect this data; we skimmed through ~11,000 papers on the first pass (aka. skimming through abstracts in Google Scholar and Web of Science), then decided on nearly 500 papers to get data from, and narrowed down to 247 papers for the publication mentioned above. Now, there was no funding for this, so it was sort of done in between other projects, but still, it was simply A LOT of tables to digitize, and graphs to extract data points from. Anyway, hopefully you will find this data useful :p

EML

I think this dataset would be a great introduction to the potential power of EML (Ecological Metadata Langauge). At rOpenSci, one of our team Carl Boettiger, along with Claas-Thido Pfaff, Duncan Temple Lang, Karthik Ram, and Matt Jones, have created an R client for EML, to parse EML files and to create and publish them.

What is EML?/Why EML?

A demonstration is in order...

Example using EML with this dataset

Install EML

library("devtools")
install.packages("RHTMLForms", repos = "http://www.omegahat.org/R/", type="source")
install_github("ropensci/EML", build=FALSE, dependencies=c("DEPENDS", "IMPORTS"))

Load EML

library('EML')

Prepare metadata

# dataset
prelim_dat <- read.csv("conditionality_data_prelim.csv")
# variable descriptions for each column
prelim_vars <- read.csv("variables_prelim.csv", stringsAsFactors = FALSE)

Get column definitions in a vector

col_defs <- prelim_vars$description

Create unit definitions for each column

unit_defs <- list(
  c(unit = "number",
    bounds = c(0, Inf)),
  c(unit = "number",
    bounds = c(0, Inf)),
  "independent replicates",
  c(unit = "number",
    bounds = c(0, Inf)),

  ... <CUTOFF>
)

Write an EML file

eml_write(prelim_dat,
          unit.defs = unit_defs,
          col.defs = col_defs,
          creator = "Scott Chamberlain",
          contact = "myrmecocystus@gmail.com",
          file = "conditionality_data_prelim_eml.xml")
## [1] "conditionality_data_prelim_eml.xml"

Validate the EML file

eml_validate("conditionality_data_prelim_eml.xml")
## EML specific tests XML specific tests 
##               TRUE               TRUE

Read data and metadata

gg <- eml_read("conditionality_data_prelim_eml.xml")
eml_get(gg, "contact")
## [1] "myrmecocystus@gmail.com"
eml_get(gg, "citation_info")
## Chamberlain S (2014-10-06). _metadata_.
dat <- eml_get(gg, "data.frame")
head(dat[,c(1:10)])
##   order i indrep avg author_last  finit_1 finit_2 finit_abv co_author
## 1     1 1      a   1      Devall margaret       s        ms     Thein
## 2     2 1      a   2      Devall margaret       s        ms     Thein
## 3     3 1      a   3      Devall margaret       s        ms     Thein
## 4     4 1      a   4      Devall margaret       s        ms     Thein
## 5     5 1      a   5      Devall margaret       s        ms     Thein
## 6     6 1      a   6      Devall margaret       s        ms     Thein
##   sinit_1
## 1 leonard
## 2 leonard
## 3 leonard
## 4 leonard
## 5 leonard
## 6 leonard

Publish

We can also use the EML package to publish the data, here to Figshare.

First, install rfigshare

install.packages("rfigshare")
library('rfigshare')

Then publish using eml_publish()

figid <- eml_publish(
            file = "conditionality_data_prelim_eml.xml",
            description = "EML file for Chamberlain, S.A., J.A. Rudgers, and J.L. Bronstein. 2014. How context-dependent are species interactions. Ecology Letters",
            categories = "Ecology",
            tags = "EML",
            destination = "figshare",
            visibility = "public",
            title = "condionality data, EML")
fs_make_public(figid)

rsunlight - R client for Sunlight Labs APIs

My last blog post on this package was so long ago the package wrapped both New York Times APIs and Sunlight Labs APIs and the package was called govdat. I split that package up into rsunlight for Sunlight Labs APIs and rtimes for some New York Times APIs. rtimes is in development at Github.

We've updated the package to include four sets of functions, one set for each of four Sunlight Labs APIs (with a separate prefix for each API):

  • Congress API (cg_)
  • Open States API (os_)
  • Capitol Words API (cw_)
  • Influence Explorer API (ie_)

Then there are many methods for each API.

rsunlight intro

Installation

First, installation

devtools::install_github("ropengov/rsunlight")

Load the library

library("rsunlight")

Congress API

Search for Fed level bills that include the term health care in them.

res <- cg_bills(query='health care')
head(res$results[,1:4])
##          nicknames congress last_version_on sponsor_id
## 1        obamacare      111      2010-08-25    S000749
## 2 obamacare, ppaca      111      2010-08-25    R000053
## 3             NULL      113      2013-10-09    K000220
## 4             NULL      111      2009-01-06    I000056
## 5             NULL      112      2011-01-05    I000056
## 6             NULL      111      2009-05-05    D000197

Search for bills that have the two terms transparency and accountability within 5 words of each other in the bill.

res <- cg_bills(query='transparency accountability'~5)
head(res$results[,1:4])
##   congress last_version_on sponsor_id
## 1      111      2009-01-15    R000435
## 2      113      2013-07-17    R000595
## 3      112      2011-12-08    R000435
## 4      113      2013-09-19    R000435
## 5      112      2011-11-10    R000595
## 6      113      2013-07-23    C000560
##                                       urls.govtrack
## 1   http://www.govtrack.us/congress/bills/111/hr557
## 2  https://www.govtrack.us/congress/bills/113/s1313
## 3  http://www.govtrack.us/congress/bills/112/hr2829
## 4 https://www.govtrack.us/congress/bills/113/hr3155
## 5   http://www.govtrack.us/congress/bills/112/s1848
## 6  https://www.govtrack.us/congress/bills/113/s1347
##                                 urls.opencongress
## 1  http://www.opencongress.org/bill/111-h557/show
## 2      http://www.opencongress.org/bill/s1313-113
## 3 http://www.opencongress.org/bill/112-h2829/show
## 4     http://www.opencongress.org/bill/hr3155-113
## 5 http://www.opencongress.org/bill/112-s1848/show
## 6      http://www.opencongress.org/bill/s1347-113
##                                          urls.congress
## 1   http://beta.congress.gov/bill/111th/house-bill/557
## 2 http://beta.congress.gov/bill/113th/senate-bill/1313
## 3  http://beta.congress.gov/bill/112th/house-bill/2829
## 4  http://beta.congress.gov/bill/113th/house-bill/3155
## 5 http://beta.congress.gov/bill/112th/senate-bill/1848
## 6 http://beta.congress.gov/bill/113th/senate-bill/1347

Open States API

Search State Bills, in this case search for the term agriculture in Texas.

res <- os_billsearch(terms = 'agriculture', state = 'tx')
head(res)
##                                                                                                                                                 title
## 1 Relating to authorizing the issuance of revenue bonds to fund capital projects at public institutions of higher education; making an appropriation.
## 2                          Relating to authorizing the issuance of revenue bonds to fund capital projects at public institutions of higher education.
## 3                          Relating to authorizing the issuance of revenue bonds to fund capital projects at public institutions of higher education.
## 4                          Relating to authorizing the issuance of revenue bonds to fund capital projects at public institutions of higher education.
## 5 Relating to authorizing the issuance of revenue bonds to fund capital projects at public institutions of higher education; making an appropriation.
## 6                                Relating to access to certain facilities by search and rescue dogs and their handlers; providing a criminal penalty.
##            created_at          updated_at          id chamber state
## 1 2013-08-01 03:33:40 2013-08-07 03:10:10 TXB00034894   upper    tx
## 2 2013-08-01 03:33:38 2013-08-02 03:20:14 TXB00034893   upper    tx
## 3 2013-07-21 03:03:53 2013-07-28 03:28:30 TXB00034814   upper    tx
## 4 2013-07-03 02:44:03 2013-07-14 03:00:31 TXB00034514   upper    tx
## 5 2013-06-16 03:48:13 2013-06-23 04:02:49 TXB00033988   upper    tx
## 6 2013-03-03 04:47:26 2013-07-01 21:25:36 TXB00027556   upper    tx
##   session type
## 1     833 bill
## 2     833 bill
## 3     832 bill
## 4     832 bill
## 5     831 bill
## 6      83 bill
##                                                                             subjects
## 1                                   Commerce, Education, Budget, Spending, and Taxes
## 2                                   Commerce, Education, Budget, Spending, and Taxes
## 3                                   Commerce, Education, Budget, Spending, and Taxes
## 4                                   Commerce, Education, Budget, Spending, and Taxes
## 5                                   Commerce, Education, Budget, Spending, and Taxes
## 6 Commerce, Business and Consumers, Animal Rights and Wildlife Issues, Health, Crime
##   bill_id
## 1    SB 3
## 2   SB 10
## 3   SB 40
## 4    SB 6
## 5   SB 44
## 6 SB 1010

Search for legislators in California (ca) and in the democratic party

res <- os_legislatorsearch(state = 'ca', party = 'democratic', fields = c('full_name','+capitol_office.phone'))
head(res)
##            phone        id       full_name
## 1 (916) 319-2014 CAL000058   Nancy Skinner
## 2 (916) 319-2015 CAL000059   Joan Buchanan
## 3 (916) 319-2022 CAL000084       Paul Fong
## 4 (916) 319-2046 CAL000089      John Pérez
## 5 (916) 319-2080 CAL000098 V. Manuel Pérez
## 6 (916) 319-2001 CAL000101  Wesley Chesbro

Now you can call each representative, yay!

Capitol Words API

Search for phrase climate change used by politicians between September 5th and 16th, 2011:

head(cw_text(phrase='climate change', start_date='2011-09-05', end_date='2011-09-16', party='D')[,c('speaker_last','origin_url')])
##   speaker_last
## 1      Tsongas
## 2       Inslee
## 3        Costa
## 4        Boxer
## 5       Durbin
## 6        Boxer
##                                                                                   origin_url
## 1 http://origin.www.gpo.gov/fdsys/pkg/CREC-2011-09-14/html/CREC-2011-09-14-pt1-PgH6149-5.htm
## 2   http://origin.www.gpo.gov/fdsys/pkg/CREC-2011-09-15/html/CREC-2011-09-15-pt1-PgH6186.htm
## 3 http://origin.www.gpo.gov/fdsys/pkg/CREC-2011-09-13/html/CREC-2011-09-13-pt1-PgE1609-2.htm
## 4   http://origin.www.gpo.gov/fdsys/pkg/CREC-2011-09-15/html/CREC-2011-09-15-pt1-PgS5650.htm
## 5   http://origin.www.gpo.gov/fdsys/pkg/CREC-2011-09-13/html/CREC-2011-09-13-pt1-PgS5510.htm
## 6 http://origin.www.gpo.gov/fdsys/pkg/CREC-2011-09-13/html/CREC-2011-09-13-pt1-PgS5513-2.htm

Plot mentions of the term climate change over time for Democrats vs. Republicans

library('ggplot2')
dat_d <- cw_timeseries(phrase='climate change', party="D")
dat_d$party <- rep("D", nrow(dat_d))
dat_r <- cw_timeseries(phrase='climate change', party="R")
dat_r$party <- rep("R", nrow(dat_r))
dat_both <- rbind(dat_d, dat_r)
ggplot(dat_both, aes(day, count, colour=party)) +
   geom_line() +
   theme_grey(base_size=20) +
   scale_colour_manual(values=c("blue","red"))

plot of chunk unnamed-chunk-9

Influence Explorer API

Search for contributions of equal to or more than $20,000,000.

ie_contr(amount='>|20000000')[,c('amount','recipient_name','contributor_name')]
##         amount
## 1  25177212.00
## 2  20000000.00
## 3  20000000.00
## 4  20000000.00
## 5  20000000.00
## 6  20000000.00
## 7  50000000.00
## 8  34000000.00
## 9  28000000.00
## 10 20000000.00
##                                                   recipient_name
## 1                                       Republican National Cmte
## 2  CALIFORNIANS TO CLOSE THE OUT-OF-STATE CORPORATE TAX LOOPHOLE
## 3                                                   WHITMAN, MEG
## 4                                                   WHITMAN, MEG
## 5                                                   WHITMAN, MEG
## 6                                                   WHITMAN, MEG
## 7                                         GOLISANO, B THOMAS (G)
## 8                                         GOLISANO, B THOMAS (G)
## 9                                         GOLISANO, B THOMAS (G)
## 10                                        GOLISANO, B THOMAS (G)
##           contributor_name
## 1           Romney Victory
## 2         STEYER, THOMAS F
## 3  WHITMAN, MARGARET (MEG)
## 4  WHITMAN, MARGARET (MEG)
## 5  WHITMAN, MARGARET (MEG)
## 6  WHITMAN, MARGARET (MEG)
## 7       GOLISANO, B THOMAS
## 8       GOLISANO, B THOMAS
## 9       GOLISANO, B THOMAS
## 10      GOLISANO, B THOMAS

Top industries, by contributions given. UNKOWN is a very influential industry. Of course law firms are high up there, as well as real estate. I'm sure oil and gas is embarrased that they're contributing less than pulic sector unions.

(res <- ie_industries(method='top_ind', limit=10))
##       count        amount                               id
## 1  14919818 3825359507.21 cdb3f500a3f74179bb4a5eb8b2932fa6
## 2   3600761 2787678962.95 f50cf984a2e3477c8167d32e2b14e052
## 3    329906 1717649914.58 9cac88377c3b400e89c2d6762e3f28f6
## 4   1386613 1707457092.04 7500030dffe24844aa467a75f7aedfd1
## 5    774496 1563637586.57 0af3f418f426497e8bbf916bfc074ebc
## 6    546367 1389220855.35 52e5d4c6c0fa47c3bdb199a28f96d434
## 7   2134350 1384221307.53 a05a0d06f6814b31bece35a81fcb40c7
## 8   1003850  986588892.83 8ada0fc2d6994f2ab06c7e025dff2284
## 9    567082  775241387.17 52766c4910a846f2813a1dda212b7027
## 10   151006  706747646.35 13718be68388456d9b6e8db753f06e72
##    should_show_entity                    name
## 1                TRUE                 UNKNOWN
## 2                TRUE       LAWYERS/LAW FIRMS
## 3                TRUE  CANDIDATE SELF-FINANCE
## 4                TRUE             REAL ESTATE
## 5                TRUE SECURITIES & INVESTMENT
## 6                TRUE    PUBLIC SECTOR UNIONS
## 7                TRUE    HEALTH PROFESSIONALS
## 8                TRUE               INSURANCE
## 9                TRUE               OIL & GAS
## 10               TRUE        CASINOS/GAMBLING
res$amount <- as.numeric(res$amount)
ggplot(res, aes(reorder(name, amount), amount)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_y_continuous(labels=dollar) +
  theme_grey(base_size = 14)

plot of chunk unnamed-chunk-11


Feedback

Please do use rsunlight, and let us know what you want fixed, new features, etc.

Still to come:

  • Functions to visualize data from each API. You can do this yourself, but a few functions will be created to help those that are new to R.
  • Vectorize functions so that you can give many inputs to a function instead of a single input.
  • test suite: embarrasingly, there is no test suite yet, boo me.
  • I plan to push rsunlight to CRAN soon as v0.3

analogsea - v0.1 notes

My last blog post introduced the R package I'm working on analogsea, an R client for the Digital Ocean API.

Things have changed a bit, including fillig out more functions for all API endpoints, and incorparting feedback from Hadley and Karthik. The package is as v0.1 now, so I thought I'd say a few things about how it works.

Note that Digital Ocean's v2 API is in beta stage now, so the current version of analogsea at v0.1 works with their v1 API. The v2 branch of analogsea is being developed for their v2 API.

If you sign up for an account with Digital Ocean use this referral link: https://www.digitalocean.com/?refcode=0740f5169634 so I can earn some credits. thx :)

First, installation

Note: I did try to submit to CRAN, but Ripley complained about the package name so I'd rather not waste my time esp since people using this likely will already know about install_github().

devtools::install_github("sckott/analogsea")

Load the library

library("analogsea")
## Loading required package: magrittr

Authenticate has changed a bit. Whereas auth details were stored as environment variables before, I'm just using R's options. do_auth() will ask for your Digital Ocean details. You can enter them each R session, or store them in your .Rprofile file. After successful authentication, each function simply looks for your auth details with getOption(). You don't have to use this function first, though if you don't your first call to another function will ask for auth details.

do_auth()

sizes, images, and keys functions have changed a bit, by default outputting a data.frame now.

List available regions

regions()
##   id            name slug
## 1  3 San Francisco 1 sfo1
## 2  4      New York 2 nyc2
## 3  5     Amsterdam 2 ams2
## 4  6     Singapore 1 sgp1

List available sizes

sizes()
##   id  name  slug memory cpu disk cost_per_hour cost_per_month
## 1 66 512MB 512mb    512   1   20       0.00744            5.0
## 2 63   1GB   1gb   1024   1   30       0.01488           10.0
## 3 62   2GB   2gb   2048   2   40       0.02976           20.0
## 4 64   4GB   4gb   4096   2   60       0.05952           40.0
## 5 65   8GB   8gb   8192   4   80       0.11905           80.0
## 6 61  16GB  16gb  16384   8  160       0.23810          160.0
## 7 60  32GB  32gb  32768  12  320       0.47619          320.0
## 8 70  48GB  48gb  49152  16  480       0.71429          480.0
## 9 69  64GB  64gb  65536  20  640       0.95238          640.0

List available images

head(images())
##        id                  name             slug distribution public sfo1
## 1 3209452 rstudioserverssh_snap             <NA>       Ubuntu  FALSE    1
## 2    1601        CentOS 5.8 x64   centos-5-8-x64       CentOS   TRUE    1
## 3    1602        CentOS 5.8 x32   centos-5-8-x32       CentOS   TRUE    1
## 4   12573        Debian 6.0 x64   debian-6-0-x64       Debian   TRUE    1
## 5   12575        Debian 6.0 x32   debian-6-0-x32       Debian   TRUE    1
## 6   14097      Ubuntu 10.04 x64 ubuntu-10-04-x64       Ubuntu   TRUE    1
##   nyc1 ams1 nyc2 ams2 sgp1
## 1   NA   NA   NA   NA   NA
## 2    1    1    1    1    1
## 3    1    1    1    1    1
## 4    1    1    1    1    1
## 5    1    1    1    1    1
## 6    1    1    1    1    1

List ssh keys

keys()
## $ssh_keys
## $ssh_keys[[1]]
## $ssh_keys[[1]]$id
## [1] 89103
##
## $ssh_keys[[1]]$name
## [1] "Scott Chamberlain"

One change that's of interest is that most of the various droplets_*() functions take in the outputs of other droplets_*() functions. This means that we can pipe outputs of one droplets_*() function to another, including non-droplet_* functions (see examples).

Let's create a droplet:

(res <- droplets_new(name="foo", size_slug = '512mb', image_slug = 'ubuntu-14-04-x64', region_slug = 'sfo1', ssh_key_ids = 89103))
$droplet
$droplet$id
[1] 1880805

$droplet$name
[1] "foo"

$droplet$image_id
[1] 3240036

$droplet$size_id
[1] 66

$droplet$event_id
[1] 26711810

List my droplets

This function used to be do_droplets_get()

droplets()
## $droplet_ids
## [1] 1880805
##
## $droplets
## $droplets[[1]]
## $droplets[[1]]$id
## [1] 1880805
##
## $droplets[[1]]$name
## [1] "foo"
##
## $droplets[[1]]$image_id
## [1] 3240036
##
## $droplets[[1]]$size_id
## [1] 66
##
## $droplets[[1]]$region_id
## [1] 3
##
## $droplets[[1]]$backups_active
## [1] FALSE
##
## $droplets[[1]]$ip_address
## [1] "162.243.152.56"
##
## $droplets[[1]]$private_ip_address
## NULL
##
## $droplets[[1]]$locked
## [1] FALSE
##
## $droplets[[1]]$status
## [1] "active"
##
## $droplets[[1]]$created_at
## [1] "2014-06-18T14:15:35Z"
##
##
##
## $event_id
## NULL

As mentioned above we can now pipe output of droplet*() functions to other droplet*() functions.

Here, pipe output of lising droplets droplets() to the events() function

droplets() %>% events()
## Error: No event id found

In this case there were no event ids to get event data on.

Here, we'll get details for the droplet we just created, then pipe that to droplets_power_off()

droplets(1880805) %>% droplets_power_off
## $droplet_ids
## [1] 1880805
##
## $droplets
## $droplets$droplet_ids
## [1] 1880805
##
## $droplets$droplets
## $droplets$droplets$id
## [1] 1880805
##
## $droplets$droplets$name
## [1] "foo"
##
## $droplets$droplets$image_id
## [1] 3240036
##
## $droplets$droplets$size_id
## [1] 66
##
## $droplets$droplets$region_id
## [1] 3
##
## $droplets$droplets$backups_active
## [1] FALSE
##
## $droplets$droplets$ip_address
## [1] "162.243.152.56"
##
## $droplets$droplets$private_ip_address
## NULL
##
## $droplets$droplets$locked
## [1] FALSE
##
## $droplets$droplets$status
## [1] "active"
##
## $droplets$droplets$created_at
## [1] "2014-06-18T14:15:35Z"
##
## $droplets$droplets$backups
## list()
##
## $droplets$droplets$snapshots
## list()
##
##
## $droplets$event_id
## NULL
##
##
## $event_id
## [1] 26714109

Then pipe it again to droplets_power_on()

droplets(1880805) %>%
  droplets_power_on
## $droplet_ids
## [1] 1880805
##
## $droplets
## $droplets$droplet_ids
## [1] 1880805
##
## $droplets$droplets
## $droplets$droplets$id
## [1] 1880805
##
## $droplets$droplets$name
## [1] "foo"
##
## $droplets$droplets$image_id
## [1] 3240036
##
## $droplets$droplets$size_id
## [1] 66
##
## $droplets$droplets$region_id
## [1] 3
##
## $droplets$droplets$backups_active
## [1] FALSE
##
## $droplets$droplets$ip_address
## [1] "162.243.152.56"
##
## $droplets$droplets$private_ip_address
## NULL
##
## $droplets$droplets$locked
## [1] FALSE
##
## $droplets$droplets$status
## [1] "off"
##
## $droplets$droplets$created_at
## [1] "2014-06-18T14:15:35Z"
##
## $droplets$droplets$backups
## list()
##
## $droplets$droplets$snapshots
## list()
##
##
## $droplets$event_id
## NULL
##
##
## $event_id
## [1] 26714152
Sys.sleep(6)
droplets(1880805)$droplets$status
## [1] "off"

Why not use more pipes?

droplets(1880805) %>%
  droplets_power_off %>%
  droplets_power_on %>%
  events

Last time I talked about installing R, RStudio, etc. on a droplet. I'm still working out bugs in that stuff, but do test out so it can get better faster. See do_install().