scrubr - clean species occurrence records

scrubr is an R library for cleaning species occurrence records. It’s general purpose, and has the following approach: We think using a piping workflow (%>%) makes code easier to build up, and easier to understand. However, you don’t have to use pipes in this package. All inputs and outputs are data.frame’s - which makes the above point easier Records trimmed off due to various filters are retained as attributes, so can still be accessed for later inspection, but don’t get in the way of the data.frame that gets modified for downstream use User interface vs. speed: This is the kind of package that surely can get faster. However, we’re focusing on the UI first, then make speed improvements down the road. Since occurrence record datasets should all have columns with lat/long information, we automatically look for those columns for you. If identified, we use them, but you can supply lat/long column names manually as well. We have many packages that fetch species occurrence records from GBIF, iNaturalist, VertNet, iDigBio, Ecoengine, and more. scrubr fills a crucial missing niche as likely all uses of occurrence data requires cleaning of some kind. When using GBIF data via rgbif, that package has some utilities for cleaning data based on the issues returned with GBIF data - scrubr is a companion to do the rest of the cleaning. ...

March 4, 2016 · 11 min · Scott Chamberlain

request - a high level HTTP client for R

request is DSL for http requests for R, and is inspired by the CLI tool httpie. It’s built on httr. The following were driving principles for this package: The web is increasingly a JSON world, so we assume applications/json by default, but give back other types if not The workflow follows logically, or at least should, from, hey, I got this url, to i need to add some options, to execute request - and functions support piping so that you can execute functions in this order Whenever possible, we transform output to data.frame’s - facilitating downstream manipulation via dplyr, etc. We do GET requests by default. Specify a different type if you don’t want GET. Given GET by default, this client is optimized for consumption of data, rather than creating new things on servers You can use non-standard evaluation to easily pass in query parameters without worrying about &’s, URL escaping, etc. (see api_query()) Same for body params (see api_body()) The following is a brief demo of some of the package functionality: ...

January 5, 2016 · 5 min · Scott Chamberlain

binomen - Tools for slicing and dicing taxonomic names

The first version of binomen is now up on CRAN. It provides various taxonomic classes for defining a single taxon, multiple taxa, and a taxonomic data.frame. It is designed as a companion to taxize, where you can get taxonomic data on taxonomic names from the web. The classes (S3): taxon taxonref taxonrefs binomial grouping (i.e., classification - used different term to avoid conflict with classification in taxize) For example, the binomial class is defined by a genus, epithet, authority, and optional full species name and canonical version. ...

December 8, 2015 · 5 min · Scott Chamberlain

Crossref programmatic clients

I gave two talks recently at the annual Crossref meeting, one of which was a somewhat technical overview of programmatic clients for Crossref APIs. Check out the talk here. I talked about the motivation for working with Crossref data by writing code/etc. rather than going the GUI route, then went over the various clients, with brief examples. We (rOpenSci) have been working on the R client rcrossref for a while now, but I’m also working on the Python and Ruby clients for Crossref. In addition, the Ruby client has a CLI client inside. The Javascript client is worked on independently by ScienceAI. ...

November 30, 2015 · 3 min · Scott Chamberlain

noaa - Integrated Surface Database data

I’ve recently made some improvements to the functions that work with ISD (Integrated Surface Database) data. isd data The isd() function now caches more intelligently. We now cache using .rds files via saveRDS/readRDS, whereas we used to use .csv files, which take up much more disk space, and we have to worry about not changing data formats on reading data back into an R session. This has the downside that you can’t just go directly to open up a cached file in your favorite spreadsheet viewer, but you can do that manually after reading in to R. In addition, isd() now has a function cleanup, if TRUE after downloading the data file from NOAA’s ftp server and processing, we delete the file. That’s fine since we have the cached processed file. But you can choose not to cleanup the original data files. Data processing in isd() is improved as well. We convert key variables to appropriate classes to be more useful. isd stations ...

October 21, 2015 · 4 min · Scott Chamberlain

Metrics for open source projects

Measuring use of open source software isn’t always straightforward. The problem is especially acute for software targeted largely at academia, where usage is not measured just by software downloads, but also by citations. Citations are a well-known pain point because the citation graph is privately held by iron doors (e.g., Scopus, Google Scholar). New ventures aim to open up citation data, but of course it’s an immense amount of work, and so does not come quickly. ...

October 19, 2015 · 5 min · Scott Chamberlain

analogsea - an R client for the Digital Ocean API

analogsea is now on CRAN. We started developing the pkg back in May 2014, but just now getting the first version on CRAN. It’s a collaboration with Hadley and Winston Chang. Most of analogsea package is for interacting with the Digital Ocean API, including: Manage domains Manage ssh keys Get actions Manage images Manage droplets (servers) A number of convenience functions are included for doing tasks (e.g., resizing a droplet) that aren’t supported by Digital Ocean’s API out of the box (i.e., there’s no API route for it). ...

October 2, 2015 · 2 min · Scott Chamberlain

oai - an OAI-PMH client

oai is a general purpose client to work with any ‘OAI-PMH’ service. The ‘OAI-PMH’ protocol is described at https://www.openarchives.org/OAI/openarchivesprotocol.html. The main functions follow the OAI-PMH verbs: GetRecord Identify ListIdentifiers ListMetadataFormats ListRecords ListSets The repo is at https://github.com/sckott/oai I will be using this in a number of packages I maintain that use OAI-PMH data services. If you try it, let me know what you think. This package is heading to rOpenSci soon: https://github.com/ropensci/onboarding/issues/19 Here’s a few usage examples: ...

September 11, 2015 · 3 min · Scott Chamberlain

fulltext - a package to help you mine text

Finally, we got fulltext up on CRAN - our first commit was May last year. fulltext is a package to facilitate text mining. It focuses on open access journals. This package makes it easier to search for articles, download those articles in full text if available, convert pdf format to plain text, and extract text chunks for vizualization/analysis. We are planning to add bits for analysis in future versions. We’ve been working on this package for a while now. It has a lot of moving parts and package dependencies, so it took a while to get a first useable version. ...

August 7, 2015 · 10 min · Scott Chamberlain

rnoaa - Weather data in R

NOAA provides a lot of weather data, across many different websites under different project names. The R package rnoaa accesses many of these, including: NOAA NCDC climate data, using the NCDC API version 2 GHCND FTP data ISD FTP data Severe weather data docs are at https://www.ncdc.noaa.gov/swdiws/ Sea ice data NOAA buoy data Tornadoes! Data from the NOAA Storm Prediction Center HOMR - Historical Observing Metadata Repository - from NOAA NCDC Storm data - from the International Best Track Archive for Climate Stewardship (IBTrACS) rnoaa used to provide access to ERDDAP servers, but a separate package rerddap focuses on just those data sources. ...

July 7, 2015 · 12 min · Scott Chamberlain