Recology

R/etc.

 

Marine Regions data in R

I was at a hackathon focused on Ocean Biogeographic Information System (OBIS) data back in November last year in Belgium. One project idea was to make it easier to get at data based on one or more marine regions. I was told that Marineregions.org is often used for shape files to get different regions to then do other work with.

During the event I started a package mregions. Here I’ll show how to get different marine regions, then pass those outputs directly to other package(s) to query for OBIS species occurrence data, then make maps.

We’ll use WKT (Well-Known Text) to define spatial dimensions in this post. If you don’t know what it is, Wikipedia to the rescue: https://en.wikipedia.org/wiki/Well-known_text

A few notes before we begin:

  • mregions will be on CRAN soon, install from github for now
  • master version of robis lives at iobis/robis, but i use a slight changed in my fork :)

Install

devtools::install_github(c("sckott/mregions", "sckott/robis"))
install.packages("leaflet")
library("mregions")

Get list of place types

res <- place_types()
head(res$type)
#> [1] "Town"                      "Arrondissement"
#> [3] "Department"                "Province (administrative)"
#> [5] "Country"                   "Continent"

Get Marineregions records by place type

res <- records_by_type(type = "EEZ")
head(res)
#>   MRGID                                            gazetteerSource
#> 1  3293 Maritime Boundaries Geodatabase, Flanders Marine Institute
#> 2  5668 Maritime Boundaries Geodatabase, Flanders Marine Institute
#> 3  5669 Maritime Boundaries Geodatabase, Flanders Marine Institute
#> 4  5670 Maritime Boundaries Geodatabase, Flanders Marine Institute
#> 5  5672 Maritime Boundaries Geodatabase, Flanders Marine Institute
#> 6  5673 Maritime Boundaries Geodatabase, Flanders Marine Institute
#>   placeType latitude longitude minLatitude minLongitude maxLatitude
#> 1       EEZ 51.46483  2.704458    51.09111     2.238118    51.87000
#> 2       EEZ 53.61508  4.190675    51.26203     2.539443    55.76500
#> 3       EEZ 54.55970  8.389231    53.24281     3.349999    55.91928
#> 4       EEZ 40.87030 19.147094    39.63863    18.461940    41.86124
#> 5       EEZ 42.94272 29.219062    41.97820    27.449580    43.74779
#> 6       EEZ 43.42847 15.650844    41.62201    13.001390    45.59079
#>   maxLongitude precision            preferredGazetteerName
#> 1     3.364907  58302.49   Belgian Exclusive Economic Zone
#> 2     7.208364 294046.10     Dutch Exclusive Economic Zone
#> 3    14.750000 395845.50    German Exclusive Economic Zone
#> 4    20.010030 139751.70  Albanian Exclusive Economic Zone
#> 5    31.345280 186792.50 Bulgarian Exclusive Economic Zone
#> 6    18.552360 313990.30  Croatian Exclusive Economic Zone
#>   preferredGazetteerNameLang   status accepted
#> 1                    English standard     3293
#> 2                    English standard     5668
#> 3                    English standard     5669
#> 4                    English standard     5670
#> 5                    English standard     5672
#> 6                    English standard     5673

Get and search region names

res <- region_names()
region_names_search(query = "IHO")
#> [[1]]
#> [[1]]$name
#> [1] "MarineRegions:iho"
#>
#> [[1]]$title
#> [1] "IHO Sea Areas"
#>
#> [[1]]$name_first
#> [1] "MarineRegions"
#>
#> [[1]]$name_second
#> [1] "iho"
#>
#>
#> [[2]]
#> [[2]]$name
#> [1] "MarineRegions:iho_quadrants_20150810"
#>
#> [[2]]$title
#> [1] "IHO quadrants (20150810)"
#>
#> [[2]]$name_first
#> [1] "MarineRegions"
#>
#> [[2]]$name_second
#> [1] "iho_quadrants_20150810"
#>
#>
#> [[3]]
#> [[3]]$name
#> [1] "World:iosregions"
#>
#> [[3]]$title
#> [1] "IOS Regions"
#>
#> [[3]]$name_first
#> [1] "World"
#>
#> [[3]]$name_second
#> [1] "iosregions"
#>
#>
#> [[4]]
#> [[4]]$name
#> [1] "MarineRegions:eez_iho_union_v2"
#>
#> [[4]]$title
#> [1] "Marineregions: the intersect of the Exclusive Economic Zones and IHO areas"
#>
#> [[4]]$name_first
#> [1] "MarineRegions"
#>
#> [[4]]$name_second
#> [1] "eez"
#>
#>
#> [[5]]
#> [[5]]$name
#> [1] "Belgium:vl_venivon"
#>
#> [[5]]$title
#> [1] "VEN (Flanders Ecological Network) and IVON (Integral Interrelated and Supporting Network)  areas in Flanders"
#>
#> [[5]]$name_first
#> [1] "Belgium"
#>
#> [[5]]$name_second
#> [1] "vl_venivon"

Get a region - geojson

res <- region_geojson(name = "Turkmen Exclusive Economic Zone")
class(res)
#> [1] "mr_geojson"
names(res)
#> [1] "type"          "totalFeatures" "features"      "crs"
#> [5] "bbox"

Get a region - shp

res <- region_shp(name = "Belgian Exclusive Economic Zone")
class(res)
#> [1] "SpatialPolygonsDataFrame"
#> attr(,"package")
#> [1] "sp"

Get OBIS EEZ ID

res <- region_names()
res <- Filter(function(x) grepl("eez", x$name, ignore.case = TRUE), res)
obis_eez_id(res[[2]]$title)
#> [1] 84

Convert to WKT

From geojson or shp. Here, geojson

res <- region_geojson(key = "MarineRegions:eez_33176")
as_wkt(res, fmt = 5)
#> [1] "MULTIPOLYGON (((41.573732 -1.659444, 45.891882 ... cutoff

Get regions, then OBIS data

Using Well-Known Text

Both shp and geojson data returned from region_shp() and region_geojson(), respectively, can be passed to as_wkt() to get WKT.

library('robis')
shp <- region_shp(name = "Belgian Exclusive Economic Zone")
wkt <- as_wkt(shp)
xx <- occurrence("Abra alba", geometry = wkt)
#>
Retrieved 695 records of 695 (100%)
xx <- xx[, c('scientificName', 'decimalLongitude', 'decimalLatitude')]
names(xx) <- c('scientificName', 'longitude', 'latitude')

Plot

library('leaflet')
leaflet() %>%
  addTiles() %>%
  addCircleMarkers(data = xx) %>%
  addPolygons(data = shp)

map1

Using EEZ ID

EEZ is a Exclusive Economic Zone

library('robis')
(eez <- obis_eez_id("Belgian Exclusive Economic Zone"))
#> [1] 59

You currently can’t search for OBIS occurrences by EEZ ID, but hopefully soon…

Dealing with bigger WKT

What if you’re WKT string is super long? It’s often a problem because some online species occurrence databases that accept WKT to search by geometry bork due to limitations on length of URLs if your WKT string is too long (about 8000 characters, including remainder of URL). One way to deal with it is to reduce detail - simplify.

devtools::install_github("ateucher/rmapshaper")

Using rmapshaper we can simplify a spatial object, then search with that.

shp <- region_shp(name = "Dutch Exclusive Economic Zone")

Visualize

leaflet() %>%
  addTiles() %>%
  addPolygons(data = shp)

map2

Simplify

library("rmapshaper")
shp <- ms_simplify(shp)

It’s simplified:

leaflet() %>%
  addTiles() %>%
  addPolygons(data = shp)

map3

Convert to WKT

wkt <- as_wkt(shp)

OBIS data

Search OBIS

library("robis")
dat <- occurrence_single(geometry = wkt, limit = 500, fields = c("species", "decimalLongitude", "decimalLatitude"))
head(dat)
#>               species decimalLongitude decimalLatitude
#> 1  Temora longicornis         3.423300        55.39170
#> 2  Temora longicornis         3.518300        55.39170
#> 3 Stragularia clavata         4.190675        53.61508
#> 4                <NA>         4.189400        53.55727
#> 5   Stylonema alsidii         4.190675        53.61508
#> 6                <NA>         4.318000        53.30720

atomize - make new packages from other packages

We (rOpenSci) just held our 3rd annual rOpenSci unconference (http://unconf16.ropensci.org/) in San Francisco. There were a lot of ideas, and lots of awesome projects from awesome people came out of the 2 day event.

One weird idea I had comes from looking at the Node world, where there are lots of tiny packages, instead of the often larger packages we have in the R world. One reason for tiny in Node is that of course you want a library to be tiny if running in the browser for faster load times (esp. on mobile).

So the idea is, what if we could separate all the functions in a package, or any particular function of your choice, into new packages, with all the internal functions and dependencies. And automatically as well, not manually.

So what are the use cases? I can’t imagine this being used to create stable packages to disperse to the world on CRAN, but it could be really useful for development purposes, or for R users/analysts that want lighter weight dependencies (e.g., a package with just the one function needed from a larger package).

This approach of course has drawbacks. The new created package is now broken apart from the original - however, beause it’s automated, you can just re-create it.

Another pain point would surely be with packages that have C/C++ code in them.

The package: atomize.

The package was made possible by the awesome functionMap package by Gábor Csárdi, and the more well-known devtools.

Expect bugs, the package has no tests. Sorry :(

Installation

devtools::install_github("ropenscilabs/atomize")
library("atomize")

usage

atomize a fxn into separate package

You can select one function, or many. Here, I’m using another package I maintain, rredlist, a pkg to interact with the IUCN Redlist API.

In this example, I want a new package called foobar with just the function rl_citation(). The function atomize::atomizer() takes the path for the package to extract from, then a path for the new package, then the function names.

atomizer(path_ref = "../rredlist", path_new = "../foobar", funcs = "rl_citation")

This creates a new package in the path_new directory

install

Now we need to install the new package, can do easily with devtools::install()

devtools::install("../foobar")

load

Then load the new package

library("foobar")

call function

Now call the function in the new package

foobar::rl_citation()
#> [1] "IUCN 2015. IUCN Red List of Threatened Species. Version 2015-4 <www.iucnredlist.org>"

it’s identical to the same function in the old package

identical(rredlist::rl_citation(), foobar::rl_citation())
#> [1] TRUE

GenBank IDs API - get, match, swap id types

GenBank IDs, accession numbers and GI identifiers, are the two types of identifiers for entries in GenBank. (see this page for why there are two types of identifiers). Actually, recent news from NCBI is that GI identifiers will be phased out by September this year, which affects what I’ll talk about below.

There are a lot of sequences in GenBank. Sometimes you have identifiers and you want to check if they exist in GenBank, or want to get one type from another (accession from GI, or vice versa; although GI phase out will make this use case no longer needed), or just get a bunch of identifiers for software testing purposes perhaps.

Currently, the ENTREZ web services aren’t super performant or easy to use for the above use cases. Thus, I’ve built out a RESTful API for these use cases, called gbids.

gbids on GitHub: sckott/gbids

Here’s the tech stack:

  • API: Ruby/Sinatra
  • Storage: MySQL
  • Caching: Redis
    • each key cached for 3 hours
  • Server: Caddy
    • https
  • Authentication: none

Will soon have a cron job update when new dump is available every Sunday, but for now we’re about a month behind the current dump of identifiers. If usage of the API picks up, I’ll know it’s worth maintaining and make sure data is up to date.

The base url is https://gbids.xyz.

Here’s a rundown of the API routes:

  • / re-routes to /heartbeat
  • /heartbeat - list routes
  • /acc - GET - list accession ids
  • /acc/:id,:id,... - GET - submit many accession numbers, get back boolean (match or no match)
  • /acc - POST
  • /gi - GET - list gi numbers
  • /gi/:id,:id,... - GET - submit many gi numbers, get back boolean (match or no match)
  • /gi - POST
  • /acc2gi/:id,:id,... - GET - get gi numbers from accession numbers
  • /acc2gi - POST
  • /gi2acc/:id,:id,... - GET - get accession numbers from gi numbers
  • /gi2acc - POST

Of course after GI identifiers are phased out, all gi routes will be removed.

The API docs are at recology.info/gbidsdocs - let me know if you have any feedback on those.

I also have a status page up at recology.info/gbidsstatus - it’s not automated, I have to update the status manually, but I do update that page whenever I’m doing maintenance and the API will be down, or if it goes down due to any other reason.

examples

Request to list accession identifiers, limit to 5

curl 'https://gbids.xyz/acc?limit=5' | jq .
{
  "matched": 692006925,
  "returned": 5,
  "data": [
    "A00002",
    "A00003",
    "X17276",
    "X60065",
    "CAA42669"
  ],
  "error": null
}

Request to match accession identifiers, some exist, while some do not

curl 'https://gbids.xyz/acc/AACY024124486,AACY024124483,asdfd,asdf,AACY024124476' | jq .
{
  "matched": 3,
  "returned": 5,
  "data": {
    "AACY024124486": true,
    "AACY024124483": true,
    "asdfd": false,
    "asdf": false,
    "AACY024124476": true
  },
  "error": null
}

to do

I think it’d probably be worth adding routes for getting accession from taxonomy id and vice versa since that’s something that is currently not very easy. We could use the dumped accession2taxid data from ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/

feedback!

If you think this could be potentially useful, do try it out and send any feedback. I watch logs from the API, so I’m hoping for an increase in usage so I know people find it useful.

Since servers aren’t free, costs add up, and if I don’t see usage pick up I’ll discontinue the service at some point. So do use it!

And if anyone can offer free servers, I’d gladly take advantage of that. I’ve applied for AWS research grant, but won’t hear back for a few months.