oai - an OAI-PMH client

oai is a general purpose client to work with any ‘OAI-PMH’ service. The ‘OAI-PMH’ protocol is described at https://www.openarchives.org/OAI/openarchivesprotocol.html. The main functions follow the OAI-PMH verbs: GetRecord Identify ListIdentifiers ListMetadataFormats ListRecords ListSets The repo is at https://github.com/sckott/oai I will be using this in a number of packages I maintain that use OAI-PMH data services. If you try it, let me know what you think. This package is heading to rOpenSci soon: https://github.com/ropensci/onboarding/issues/19 Here’s a few usage examples: ...

September 11, 2015 · 3 min · Scott Chamberlain

fulltext - a package to help you mine text

Finally, we got fulltext up on CRAN - our first commit was May last year. fulltext is a package to facilitate text mining. It focuses on open access journals. This package makes it easier to search for articles, download those articles in full text if available, convert pdf format to plain text, and extract text chunks for vizualization/analysis. We are planning to add bits for analysis in future versions. We’ve been working on this package for a while now. It has a lot of moving parts and package dependencies, so it took a while to get a first useable version. ...

August 7, 2015 · 10 min · Scott Chamberlain

rnoaa - Weather data in R

NOAA provides a lot of weather data, across many different websites under different project names. The R package rnoaa accesses many of these, including: NOAA NCDC climate data, using the NCDC API version 2 GHCND FTP data ISD FTP data Severe weather data docs are at https://www.ncdc.noaa.gov/swdiws/ Sea ice data NOAA buoy data Tornadoes! Data from the NOAA Storm Prediction Center HOMR - Historical Observing Metadata Repository - from NOAA NCDC Storm data - from the International Best Track Archive for Climate Stewardship (IBTrACS) rnoaa used to provide access to ERDDAP servers, but a separate package rerddap focuses on just those data sources. ...

July 7, 2015 · 12 min · Scott Chamberlain

rerddap - General purpose R client for ERDDAP servers

ERDDAP is a data server that gives you a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps. Besides it’s own RESTful interface, much of which is designed based on OPeNDAP, ERDDAP can act as an OPeNDAP server and as a WMS server for gridded data. ERDDAP is a powerful tool - in a world of heterogeneous data, it’s often hard to combine data and serve it through the same interface, with tools for querying/filtering/subsetting the data. That is exactly what ERDDAP does. Heterogeneous data sets often have some similarities, such as latitude/longitude data and usually a time component, but other variables vary widely. ...

June 24, 2015 · 8 min · Scott Chamberlain

iDigBio - a new data source in spocc

iDigBio, or Integrated Digitized Biocollections, collects and provides access to species occurrence data, and associated metadata (e.g., images of specimens, when provided). They collect data from a lot of different providers. They have a nice web interface for searching, check out idigbio.org/portal/search. spocc is a package we’ve been working on at rOpenSci for a while now - it is a one stop shop for retrieving species ocurrence data. As new sources of species occurrence data come to our attention, and are available via a RESTful API, we incorporate them into spocc. ...

June 8, 2015 · 3 min · Scott Chamberlain

lawn - a new package to do geospatial analysis

lawn is an R wrapper for the Javascript library turf.js for advanced geospatial analysis. In addition, we have a few functions to interface with the geojson-random Javascript library. lawn includes traditional spatial operations, helper functions for creating GeoJSON data, and data classification and statistics tools. There is an additional helper function (see view()) in this package to help visualize data with interactive maps via the leaflet package (https://github.com/rstudio/leaflet). Note that leaflet is not required to install lawn - it’s in Suggests, not Imports or Depends. ...

May 18, 2015 · 5 min · Scott Chamberlain

openadds - open addresses client

openadds talks to Openaddresses.io. a run down of its things: Install devtools::install_github("sckott/openadds") library("openadds") List datasets Scrapes links to datasets from the openaddresses site dat <- oa_list() dat[2:6] #> [1] "https://data.openaddresses.io.s3.amazonaws.com/20150511/au-tas-launceston.csv" #> [2] "https://s3.amazonaws.com/data.openaddresses.io/20141127/au-victoria.zip" #> [3] "https://data.openaddresses.io.s3.amazonaws.com/20150511/be-flanders.zip" #> [4] "https://data.openaddresses.io.s3.amazonaws.com/20150417/ca-ab-calgary.zip" #> [5] "https://data.openaddresses.io.s3.amazonaws.com/20150511/ca-ab-grande_prairie.zip" Search for datasets Uses oa_list() internally, then searches through columns requested. oa_search(country = "us", state = "ca") #> Source: local data frame [68 x 5] #> #> country state city ext #> 1 us ca san_mateo_county .zip #> 2 us ca alameda_county .zip #> 3 us ca alameda_county .zip #> 4 us ca amador .zip #> 5 us ca amador .zip #> 6 us ca bakersfield .zip #> 7 us ca bakersfield .zip #> 8 us ca berkeley .zip #> 9 us ca berkeley .zip #> 10 us ca butte_county .zip #> .. ... ... ... ... #> Variables not shown: url (chr) Get data Passing in a URL ...

May 18, 2015 · 5 min · Scott Chamberlain

geojsonio - a new package to do geojson things

geojsonio converts geographic data to GeoJSON and TopoJSON formats - though the focus is mostly on GeoJSON For those not familiar GeoJSON it is a format for encoding a variety of geographic data structures. GeoJSON supports the following geometry types: Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection. These geometry types are also found in well known text (WKT), and have equivalents in R’s spatial classes. Read the spec for more detailed information. ...

April 30, 2015 · 5 min · Scott Chamberlain

the new way - httsnap

Inspired by httpie, a Python command line client as a sort of drop in replacement for curl, I am playing around with something similar-ish in R, at least in spirit. I started a little R pkg called httsnap with the following ideas: The web is increasingly a JSON world, so set content-type and accept headers to applications/json by default The workflow follows logically, or at least should, from, hey, I got this url, to i need to add some options, to execute request Whenever possible, transform output to data.frame’s - facilitating downstream manipulation via dplyr, etc. Do GET requests by default. Specify a different type if you don’t want GET. Some functionality does GET by default, though in some cases you need to specify GET You can use non-standard evaluation to easily pass in query parameters without worrying about &’s, URL escaping, etc. (see Query()) Same for body params (see Body()) Install Install and load httsnap ...

April 29, 2015 · 4 min · Scott Chamberlain

Faster solr with csv

With the help of user input, I’ve tweaked solr just a bit to make things faster using default setings. I imagine the main interface for people using the solr R client is via solr_search(), which used to have wt=json by default. Changing this to wt=csv gives better performance. And it sorta makes sense to use csv, as the point of using an R client is probably do get data eventually into a data.frame, so it makes sense to go csv format (Already in tabular format) if it’s faster too. ...

March 20, 2015 · 3 min · Scott Chamberlain