Recology

R/etc.

 

fulltext - a package to help you mine text

Finally, we got fulltext up on CRAN - our first commit was May last year. fulltext is a package to facilitate text mining. It focuses on open access journals. This package makes it easier to search for articles, download those articles in full text if available, convert pdf format to plain text, and extract text chunks for vizualization/analysis. We are planning to add bits for analysis in future versions. We've been working on this package for a while now. It has a lot of moving parts and package dependencies, so it took a while to get a first useable version.

The tasks facilitated by fulltext in bullet form:

  • Search - search for articles
  • Retrieve - get full text
  • Convert - convert from format X to Y
  • Text - if needed, get text from pdfs/etc.
  • Extract - pull out the bits of articles that you want

I won't be surprised if users uncover a lot of bugs in this package given the huge number of publishers/journals users want to get literature data from, and the surely wide diversity of use cases. But I thought it was important to get out a first version to get feedback on the user interface, and gather use cases.

We hope that this package can help bring text-mining to the masses - making it easy for anyone to do do, not just text-mining experts.

If you have any feedback, please do get in touch in the issue tracker for fulltext at https://github.com/ropensci/fulltext/issues - If you have use case thoughts, the rOpenSci discussion forum might be a good place to go.

Let's kick the tires, shall we?

Install

Will be on CRAN soon, not as of AM PDT on 2015-08-07.

install.packages("fulltext")
# if binaries not avail. yet on your favorite CRAN mirror
install.packages("https://cran.rstudio.com/src/contrib/fulltext_0.1.0.tar.gz", repos = NULL, type = "source")

Or install development version from GitHub

devtools::install_github("ropensci/fulltext")

Load fulltext

library("fulltext")

Search for articles

Currently, there are hooks for searching for articles from PLOS, BMC, Crossref, Entrez, arXiv, and BioRxiv. We'll add more in the future, but that does cover a lot of articles, especially given inclusion of Crossref (which mints most DOIs) and Entrez (which houses PMC and Pubmed).

An example: Search for the term ecology in PLOS journals.

(res1 <- ft_search(query = 'ecology', from = 'plos'))
#> Query:
#>   [ecology] 
#> Found:
#>   [PLoS: 28589; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0] 
#> Returned:
#>   [PLoS: 10; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0]

Each publisher/search-engine has a slot with metadata and data, saying how many articles were found and how many were returned. We can dig into what PLOS gave us:

res1$plos
#> Query: [ecology] 
#> Records found, returned: [28589, 10] 
#> License: [CC-BY] 
#>                                                         id
#> 1                             10.1371/journal.pone.0059813
#> 2                             10.1371/journal.pone.0001248
#> 3  10.1371/annotation/69333ae7-757a-4651-831c-f28c5eb02120
#> 4                             10.1371/journal.pone.0080763
#> 5                             10.1371/journal.pone.0102437
#> 6                             10.1371/journal.pone.0017342
#> 7                             10.1371/journal.pone.0091497
#> 8                             10.1371/journal.pone.0092931
#> 9  10.1371/annotation/28ac6052-4f87-4b88-a817-0cd5743e83d6
#> 10                            10.1371/journal.pcbi.1003594

For each of the data sources to search on you can pass in additional options (basically, you can use the query parameters in the functions that hit each service). Here, we can modify our search to PLOS by requesting a particular set of fields with the fl parameter (PLOS uses a Solr backed search engine, and fl is short for fields in Solr land):

ft_search(query = 'ecology', from = 'plos', plosopts = list(
   fl = c('id','author','eissn','journal','counter_total_all','alm_twitterCount')))
#> Query:
#>   [ecology] 
#> Found:
#>   [PLoS: 28589; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0] 
#> Returned:
#>   [PLoS: 10; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0]

Note that PLOS is a bit unique in allowing you to request specific parts of articles. Other sources in ft_search() don't let you do that.

Get full text

After you've found the set of articles you want to get full text for, we can use the results from ft_search() to grab full text. ft_get() accepts a character vector of list of DOIs (or PMC IDs if fetching from Entrez), or the output of ft_search().

(out <- ft_get(res1))
#> [Docs] 8 
#> [Source] R session  
#> [IDs] 10.1371/journal.pone.0059813 10.1371/journal.pone.0001248
#>      10.1371/journal.pone.0080763 10.1371/journal.pone.0102437
#>      10.1371/journal.pone.0017342 10.1371/journal.pone.0091497
#>      10.1371/journal.pone.0092931 10.1371/journal.pcbi.1003594 ...

We got eight articles in full text in the result. We didn't get 10, even though 10 were returned from ft_search() because PLOS often returns records for annotations, that is, comments on articles, which we auto-seive out within ft_get().

Dig in to the PLOS data

out$plos
#> $found
#> [1] 8
#> 
#> $dois
#> [1] "10.1371/journal.pone.0059813" "10.1371/journal.pone.0001248"
#> [3] "10.1371/journal.pone.0080763" "10.1371/journal.pone.0102437"
#> [5] "10.1371/journal.pone.0017342" "10.1371/journal.pone.0091497"
#> [7] "10.1371/journal.pone.0092931" "10.1371/journal.pcbi.1003594"
#> 
#> $data
#> $data$backend
#> NULL
#> 
#> $data$path
#> [1] "session"
#> 
#> $data$data
#> 8 full-text articles retrieved 
#> Min. Length: 3828 - Max. Length: 104702 
#> DOIs: 10.1371/journal.pone.0059813 10.1371/journal.pone.0001248
#>   10.1371/journal.pone.0080763 10.1371/journal.pone.0102437
#>   10.1371/journal.pone.0017342 10.1371/journal.pone.0091497
#>   10.1371/journal.pone.0092931 10.1371/journal.pcbi.1003594 ... 
#> 
#> NOTE: extract xml strings like output['<doi>']
#> 
#> $opts
#> $opts$doi
#> [1] "10.1371/journal.pone.0059813" "10.1371/journal.pone.0001248"
#> [3] "10.1371/journal.pone.0080763" "10.1371/journal.pone.0102437"
#> [5] "10.1371/journal.pone.0017342" "10.1371/journal.pone.0091497"
#> [7] "10.1371/journal.pone.0092931" "10.1371/journal.pcbi.1003594"
#> 
#> $opts$callopts
#> list()

Dig in further to get to one of the articles in XML format

library("xml2")
xml2::read_xml(out$plos$data$data$`10.1371/journal.pone.0059813`)
#> {xml_document}
#> <article>
#> [1] <front>\n<journal-meta>\n<journal-id journal-id-type="nlm-ta">PLoS O ...
#> [2] <body>\n  <sec id="s1">\n<title>Introduction</title>\n<p>Ecologists  ...
#> [3] <back>\n<ack>\n<p>Curtis Flather, Mark Burgman, Leon Blaustein, Yaac ...

Now with the xml, you can dig into whatever you like, e.g., using xml2 or rvest.

Extract text from pdfs

Ideally for text mining you have access to XML or other text based formats. However, sometimes you only have access to PDFs. In this case you want to extract text from PDFs. fulltext can help with that.

You can extract from any pdf from a file path, like:

path <- system.file("examples", "example1.pdf", package = "fulltext")
ft_extract(path)
#> <document>/Library/Frameworks/R.framework/Versions/3.2/Resources/library/fulltext/examples/example1.pdf
#>   Pages: 18
#>   Title: Suffering and mental health among older people living in nursing homes---a mixed-methods study
#>   Producer: pdfTeX-1.40.10
#>   Creation date: 2015-07-17

Let's search for articles from arXiv, a preprint service. Here, get pdf from an article with ID cond-mat/9309029:

res <- ft_get('cond-mat/9309029', from = "arxiv")
res2 <- ft_extract(res)
res2$arxiv$data
#> $backend
#> NULL
#> 
#> $path
#> $path$`cond-mat/9309029`
#> [1] "~/.fulltext/cond-mat_9309029.pdf"
#> 
#> 
#> $data
#> $data[[1]]
#> <document>/Users/sacmac/.fulltext/cond-mat_9309029.pdf
#>   Pages: 14
#>   Title: arXiv:cond-mat/9309029v8  26 Jan 1994
#>   Producer: GPL Ghostscript SVN PRE-RELEASE 8.62
#>   Creation date: 2008-02-06

And a short snippet of the full text

res2$arxiv$data$data[[1]]$data
#> "arXiv:cond-mat/9309029v8 26 Jan 1994, , FERMILAB-PUB-93/15-T March 1993, Revised:
#> January 1994, The Thermodynamics and Economics of Waste, Dallas C. Kennedy, Research
#> Associate, Fermi National Accelerator Laboratory, P.O. Box 500 MS106, Batavia, Illinois
#> 60510 USA, Abstract, The increasingly relevant problem of natural resource use and
#> waste production, disposal, and reuse is examined from several viewpoints: economic,
#> technical, and thermodynamic. Alternative economies are studied, with emphasis on
#> recycling of waste to close the natural resource cycle. The physical nature of human
#> economies and constraints on recycling and energy efficiency are stated in terms
#> ..."

Extract text chunks

We have a few functions to help you pull out certain parts of an article. For example, perhaps you want to get just the authors from your articles, or just the abstracts.

Here, we'll search for some PLOS articles, then get their full text, then extract various parts of each article with chunks().

res <- ft_search(query = "ecology", from = "plos")
(x <- ft_get(res))
#> [Docs] 8 
#> [Source] R session  
#> [IDs] 10.1371/journal.pone.0059813 10.1371/journal.pone.0001248
#>      10.1371/journal.pone.0080763 10.1371/journal.pone.0102437
#>      10.1371/journal.pone.0017342 10.1371/journal.pone.0091497
#>      10.1371/journal.pone.0092931 10.1371/journal.pcbi.1003594 ...

Extract DOIs

x %>% chunks("doi")
#> $plos
#> $plos$`10.1371/journal.pone.0059813`
#> $plos$`10.1371/journal.pone.0059813`$doi
#> [1] "10.1371/journal.pone.0059813"
#> 
#> 
#> $plos$`10.1371/journal.pone.0001248`
#> $plos$`10.1371/journal.pone.0001248`$doi
#> [1] "10.1371/journal.pone.0001248"
#> 
#> 
#> $plos$`10.1371/journal.pone.0080763`
#> $plos$`10.1371/journal.pone.0080763`$doi
#> [1] "10.1371/journal.pone.0080763"
#> 
#> 
#> $plos$`10.1371/journal.pone.0102437`
#> $plos$`10.1371/journal.pone.0102437`$doi
#> [1] "10.1371/journal.pone.0102437"
#> 
#> 
#> $plos$`10.1371/journal.pone.0017342`
#> $plos$`10.1371/journal.pone.0017342`$doi
#> [1] "10.1371/journal.pone.0017342"
#> 
#> 
#> $plos$`10.1371/journal.pone.0091497`
#> $plos$`10.1371/journal.pone.0091497`$doi
#> [1] "10.1371/journal.pone.0091497"
#> 
#> 
#> $plos$`10.1371/journal.pone.0092931`
#> $plos$`10.1371/journal.pone.0092931`$doi
#> [1] "10.1371/journal.pone.0092931"
#> 
#> 
#> $plos$`10.1371/journal.pcbi.1003594`
#> $plos$`10.1371/journal.pcbi.1003594`$doi
#> [1] "10.1371/journal.pcbi.1003594"

Extract DOIs and categories

x %>% chunks(c("doi","categories"))
#> $plos
#> $plos$`10.1371/journal.pone.0059813`
#> $plos$`10.1371/journal.pone.0059813`$doi
#> [1] "10.1371/journal.pone.0059813"
#> 
#> $plos$`10.1371/journal.pone.0059813`$categories
#>  [1] "Research Article"                 "Biology"                         
#>  [3] "Ecology"                          "Community ecology"               
#>  [5] "Species interactions"             "Science policy"                  
#>  [7] "Research assessment"              "Research monitoring"             
#>  [9] "Research funding"                 "Government funding of science"   
#> [11] "Research laboratories"            "Science policy and economics"    
#> [13] "Science and technology workforce" "Careers in research"             
#> [15] "Social and behavioral sciences"   "Sociology"                       
#> [17] "Sociology of knowledge"          
#> 
#> 
#> $plos$`10.1371/journal.pone.0001248`
#> $plos$`10.1371/journal.pone.0001248`$doi
#> [1] "10.1371/journal.pone.0001248"
#> 
#> $plos$`10.1371/journal.pone.0001248`$categories
#> [1] "Research Article"             "Ecology"                     
#> [3] "Ecology/Ecosystem Ecology"    "Ecology/Evolutionary Ecology"
#> [5] "Ecology/Theoretical Ecology" 
#> 
#> 
#> $plos$`10.1371/journal.pone.0080763`
#> $plos$`10.1371/journal.pone.0080763`$doi
#> [1] "10.1371/journal.pone.0080763"
#> 
#> $plos$`10.1371/journal.pone.0080763`$categories
#>  [1] "Research Article"     "Biology"              "Ecology"             
#>  [4] "Autecology"           "Behavioral ecology"   "Community ecology"   
#>  [7] "Evolutionary ecology" "Population ecology"   "Evolutionary biology"
#> [10] "Behavioral ecology"   "Evolutionary ecology" "Population biology"  
#> [13] "Population ecology"  
#> 
#> 
#> $plos$`10.1371/journal.pone.0102437`
#> $plos$`10.1371/journal.pone.0102437`$doi
#> [1] "10.1371/journal.pone.0102437"
#> 
#> $plos$`10.1371/journal.pone.0102437`$categories
#>  [1] "Research Article"                  
#>  [2] "Biology and life sciences"         
#>  [3] "Biogeography"                      
#>  [4] "Ecology"                           
#>  [5] "Ecosystems"                        
#>  [6] "Ecosystem engineering"             
#>  [7] "Ecosystem functioning"             
#>  [8] "Industrial ecology"                
#>  [9] "Spatial and landscape ecology"     
#> [10] "Urban ecology"                     
#> [11] "Computer and information sciences" 
#> [12] "Geoinformatics"                    
#> [13] "Spatial analysis"                  
#> [14] "Earth sciences"                    
#> [15] "Geography"                         
#> [16] "Human geography"                   
#> [17] "Cultural geography"                
#> [18] "Social geography"                  
#> [19] "Ecology and environmental sciences"
#> [20] "Conservation science"              
#> [21] "Environmental protection"          
#> [22] "Nature-society interactions"       
#> 
#> 
#> $plos$`10.1371/journal.pone.0017342`
#> $plos$`10.1371/journal.pone.0017342`$doi
#> [1] "10.1371/journal.pone.0017342"
#> 
#> $plos$`10.1371/journal.pone.0017342`$categories
#>  [1] "Research Article"     "Biology"              "Ecology"             
#>  [4] "Community ecology"    "Community assembly"   "Community structure" 
#>  [7] "Niche construction"   "Ecological metrics"   "Species diversity"   
#> [10] "Species richness"     "Biodiversity"         "Biogeography"        
#> [13] "Population ecology"   "Mathematics"          "Statistics"          
#> [16] "Biostatistics"        "Statistical theories" "Ecology"             
#> [19] "Mathematics"         
#> 
#> 
#> $plos$`10.1371/journal.pone.0091497`
#> $plos$`10.1371/journal.pone.0091497`$doi
#> [1] "10.1371/journal.pone.0091497"
#> 
#> $plos$`10.1371/journal.pone.0091497`$categories
#> [1] "Correction"
#> 
#> 
#> $plos$`10.1371/journal.pone.0092931`
#> $plos$`10.1371/journal.pone.0092931`$doi
#> [1] "10.1371/journal.pone.0092931"
#> 
#> $plos$`10.1371/journal.pone.0092931`$categories
#> [1] "Correction"
#> 
#> 
#> $plos$`10.1371/journal.pcbi.1003594`
#> $plos$`10.1371/journal.pcbi.1003594`$doi
#> [1] "10.1371/journal.pcbi.1003594"
#> 
#> $plos$`10.1371/journal.pcbi.1003594`$categories
#> [1] "Research Article"          "Biology and life sciences"
#> [3] "Computational biology"     "Microbiology"             
#> [5] "Theoretical biology"

tabularize attempts to help you put the data that comes out of chunks() in to a data.frame, that we all know and love.

x %>% chunks(c("doi", "history")) %>% tabularize()
#> $plos
#>                            doi history.received history.accepted
#> 1 10.1371/journal.pone.0059813       2012-09-16       2013-02-19
#> 2 10.1371/journal.pone.0001248       2007-07-02       2007-11-06
#> 3 10.1371/journal.pone.0080763       2013-08-15       2013-10-16
#> 4 10.1371/journal.pone.0102437       2013-11-27       2014-06-19
#> 5 10.1371/journal.pone.0017342       2010-08-24       2011-01-31
#> 6 10.1371/journal.pone.0091497             <NA>             <NA>
#> 7 10.1371/journal.pone.0092931             <NA>             <NA>
#> 8 10.1371/journal.pcbi.1003594       2014-01-09       2014-03-14

Bring it all together

With the pieces above, let's see what it looks like all in one go. Here, we'll search for articles on climate change, then visualize word usage in those articles.

Search

(out <- ft_search(query = 'climate change', from = 'plos', limit = 100))
#> Query:
#>   [climate change] 
#> Found:
#>   [PLoS: 11737; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0] 
#> Returned:
#>   [PLoS: 100; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0]

Get full text

(texts <- ft_get(out))
#> [Docs] 99 
#> [Source] R session  
#> [IDs] 10.1371/journal.pone.0054839 10.1371/journal.pone.0045683
#>      10.1371/journal.pone.0050182 10.1371/journal.pone.0118489
#>      10.1371/journal.pone.0053646 10.1371/journal.pone.0015103
#>      10.1371/journal.pone.0008320 10.1371/journal.pmed.1001227
#>      10.1371/journal.pmed.1001374 10.1371/journal.pone.0097480 ...

Because PLOS returns XML, we don't need to do a PDF extraction step. However, if we got full text from arXiv or bioRxiv, we'd need to extract from PDFs first.

Pull out chunks

abs <- texts %>% chunks("abstract")

Let's pull out just the text

abs <- lapply(abs$plos, function(z) {
  paste0(z$abstract, collapse = " ")
})

Analyze

Using the tm package, we can analyze our articles

library("tm")
corp <- VCorpus(VectorSource(abs))
# remove stop words, strip whitespace, remove punctuation
corp <- tm_map(corp, removeWords, stopwords("english"))
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removePunctuation)
# Make a term document matrix
tdm <- TermDocumentMatrix(corp)
# remove sparse terms
tdm <- removeSparseTerms(tdm, sparse = 0.8)
# get data
rs <- rowSums(as.matrix(tdm))
df <- data.frame(word = names(rs), n = unname(rs), stringsAsFactors = FALSE)

Visualize

library("ggplot2")
ggplot(df, aes(reorder(word, n), n)) +
  geom_point() +
  coord_flip() +
  labs(y = "Count", x = "Word")

plot of chunk unnamed-chunk-23

rnoaa - Weather data in R

NOAA provides a lot of weather data, across many different websites under different project names. The R package rnoaa accesses many of these, including:

rnoaa used to provide access to ERDDAP servers, but a separate package rerddap focuses on just those data sources.

We focus on getting you the data, so there's very little in rnoaa for visualizing, statistics, etc.

Installation

The newest version should be on CRAN in the next few days. In the meantime, let's install from GitHub

devtools::install_github("ropensci/rnoaa")
library("rnoaa")

There's an example using the lawn, sp, and dplyr packages. If you want to try those, install like

install.packages(c("lawn", "dplyr", "sp"))

NCDC

This web service requires an API key - get one at http://www.ncdc.noaa.gov/cdo-web/token if you don't already have one. NCDC provides access to many different datasets:

Dataset Description Start date End date
ANNUAL Annual Summaries 1831-02-01 2013-11-01
GHCND Daily Summaries 1763-01-01 2014-03-15
GHCNDMS Monthly Summaries 1763-01-01 2014-01-01
NORMAL_ANN Normals Annual/Seasonal 2010-01-01 2010-01-01
NORMAL_DLY Normals Daily 2010-01-01 2010-12-31
NORMAL_HLY Normals Hourly 2010-01-01 2010-12-31
NORMAL_MLY Normals Monthly 2010-01-01 2010-12-01
PRECIP_15 Precipitation 15 Minute 1970-05-12 2013-03-01
PRECIP_HLY Precipitation Hourly 1900-01-01 2013-03-01
NEXRAD2 Nexrad Level II 1991-06-05 2014-03-14
NEXRAD3 Nexrad Level III 1994-05-20 2014-03-11

The main function to get data from NCDC is ncdc(). datasetid, startdate, and enddate are required parameters. A quick example, here getting data from the GHCND dataset, from a particular station, and from Oct 1st 2013 to Dec 12th 2013:

ncdc(datasetid = 'GHCND', stationid = 'GHCND:USW00014895', startdate = '2013-10-01',
   enddate = '2013-12-01')
#> $meta
#> $meta$totalCount
#> [1] 697
#> 
#> $meta$pageCount
#> [1] 25
#> 
#> $meta$offset
#> [1] 1
#> 
#> 
#> $data
#> Source: local data frame [25 x 8]
#> 
#>                   date datatype           station value fl_m fl_q fl_so
#> 1  2013-10-01T00:00:00     AWND GHCND:USW00014895    29               W
#> 2  2013-10-01T00:00:00     PRCP GHCND:USW00014895     0               W
#> 3  2013-10-01T00:00:00     SNOW GHCND:USW00014895     0               W
#> 4  2013-10-01T00:00:00     SNWD GHCND:USW00014895     0               W
#> 5  2013-10-01T00:00:00     TAVG GHCND:USW00014895   179    H          S
#> 6  2013-10-01T00:00:00     TMAX GHCND:USW00014895   250               W
#> 7  2013-10-01T00:00:00     TMIN GHCND:USW00014895   133               W
#> 8  2013-10-01T00:00:00     WDF2 GHCND:USW00014895   210               W
#> 9  2013-10-01T00:00:00     WDF5 GHCND:USW00014895   230               W
#> 10 2013-10-01T00:00:00     WSF2 GHCND:USW00014895    76               W
#> ..                 ...      ...               ...   ...  ...  ...   ...
#> Variables not shown: fl_t (chr)
#> 
#> attr(,"class")
#> [1] "ncdc_data"

You probably won't know what station you want data from off hand though, so you can first search for stations, in this example using a bounding box that defines a rectangular area near Seattle

library("lawn")
lawn_bbox_polygon(c(-122.2047, 47.5204, -122.1065, 47.6139)) %>% view

lawnplot

We'll search within that bounding box for weather stations.

ncdc_stations(extent = c(47.5204, -122.2047, 47.6139, -122.1065))
#> $meta
#> $meta$totalCount
#> [1] 9
#> 
#> $meta$pageCount
#> [1] 25
#> 
#> $meta$offset
#> [1] 1
#> 
#> 
#> $data
#> Source: local data frame [9 x 9]
#> 
#>   elevation    mindate    maxdate latitude                         name
#> 1     199.6 2008-06-01 2015-06-29  47.5503      EASTGATE 1.7 SSW, WA US
#> 2     240.8 2010-05-01 2015-07-05  47.5604       EASTGATE 1.1 SW, WA US
#> 3      85.6 2008-07-01 2015-07-05  47.5916        BELLEVUE 0.8 S, WA US
#> 4     104.2 2008-06-01 2015-07-05  47.5211 NEWPORT HILLS 1.9 SSE, WA US
#> 5      58.5 2008-08-01 2009-04-12  47.6138      BELLEVUE 2.3 ENE, WA US
#> 6     199.9 2008-06-01 2009-11-22  47.5465   NEWPORT HILLS 1.4 E, WA US
#> 7      27.1 2008-07-01 2015-07-05  47.6046        BELLEVUE 1.8 W, WA US
#> 8     159.4 2008-11-01 2015-07-05  47.5694      BELLEVUE 2.3 SSE, WA US
#> 9      82.3 2008-12-01 2010-09-17  47.6095       BELLEVUE 0.6 NE, WA US
#> Variables not shown: datacoverage (dbl), id (chr), elevationUnit (chr),
#>   longitude (dbl)
#> 
#> attr(,"class")
#> [1] "ncdc_stations"

And there are 9 found. We could then use their station ids (e.g., GHCND:US1WAKG0024) to search for data using ncdc(), or search for what kind of data that station has with ncdc_datasets(), or other functions.

GHCND

  • GHCND = Global Historical Climatology Network Daily (Data)
  • Data comes from an FTP server
library("dplyr")
dat <- ghcnd(stationid = "AGE00147704")
dat$data %>%
  filter(element == "PRCP", year == 1909)
#>            id year month element VALUE1 MFLAG1 QFLAG1 SFLAG1 VALUE2 MFLAG2
#> 1 AGE00147704 1909    11    PRCP  -9999     NA                -9999     NA
#> 2 AGE00147704 1909    12    PRCP     23     NA             E      0     NA
#>   QFLAG2 SFLAG2 VALUE3 MFLAG3 QFLAG3 SFLAG3 VALUE4 MFLAG4 QFLAG4 SFLAG4
#> 1                -9999     NA                -9999     NA              
#> 2             E      0     NA             E      0     NA             E
#>   VALUE5 MFLAG5 QFLAG5 SFLAG5 VALUE6 MFLAG6 QFLAG6 SFLAG6 VALUE7 MFLAG7
#> 1  -9999     NA                -9999     NA                -9999     NA
#> 2      0     NA             E      0     NA             E      0     NA
#>   QFLAG7 SFLAG7 VALUE8 MFLAG8 QFLAG8 SFLAG8 VALUE9 MFLAG9 QFLAG9 SFLAG9
#> 1     NA         -9999     NA                -9999     NA              
#> 2     NA      E    250     NA             E     75     NA             E
#>   VALUE10 MFLAG10 QFLAG10 SFLAG10 VALUE11 MFLAG11 QFLAG11 SFLAG11 VALUE12
#> 1   -9999      NA                   -9999      NA                   -9999
#> 2     131      NA               E       0      NA               E       0
#>   MFLAG12 QFLAG12 SFLAG12 VALUE13 MFLAG13 QFLAG13 SFLAG13 VALUE14 MFLAG14
#> 1      NA                   -9999      NA                   -9999      NA
#> 2      NA               E       0      NA               E       0      NA
#>   QFLAG14 SFLAG14 VALUE15 MFLAG15 QFLAG15 SFLAG15 VALUE16 MFLAG16 QFLAG16
#> 1                   -9999      NA                   -9999      NA        
#> 2               E       0      NA               E       0      NA        
#>   SFLAG16 VALUE17 MFLAG17 QFLAG17 SFLAG17 VALUE18 MFLAG18 QFLAG18 SFLAG18
#> 1           -9999      NA                   -9999      NA                
#> 2       E       0      NA               E       0      NA               E
#>   VALUE19 MFLAG19 QFLAG19 SFLAG19 VALUE20 MFLAG20 QFLAG20 SFLAG20 VALUE21
#> 1   -9999      NA      NA           -9999      NA      NA           -9999
#> 2       0      NA      NA       E       0      NA      NA       E       0
#>   MFLAG21 QFLAG21 SFLAG21 VALUE22 MFLAG22 QFLAG22 SFLAG22 VALUE23 MFLAG23
#> 1      NA                   -9999      NA                      22      NA
#> 2      NA               E       0      NA               E       0      NA
#>   QFLAG23 SFLAG23 VALUE24 MFLAG24 QFLAG24 SFLAG24 VALUE25 MFLAG25 QFLAG25
#> 1      NA       E       9      NA      NA       E       5      NA      NA
#> 2      NA       E       0      NA      NA       E       0      NA      NA
#>   SFLAG25 VALUE26 MFLAG26 QFLAG26 SFLAG26 VALUE27 MFLAG27 QFLAG27 SFLAG27
#> 1       E       0      NA               E      86      NA      NA       E
#> 2       E       0      NA               E       0      NA      NA       E
#>   VALUE28 MFLAG28 QFLAG28 SFLAG28 VALUE29 MFLAG29 QFLAG29 SFLAG29 VALUE30
#> 1       0      NA      NA       E      28      NA      NA       E       0
#> 2       0      NA      NA       E       0      NA      NA       E       0
#>   MFLAG30 QFLAG30 SFLAG30 VALUE31 MFLAG31 QFLAG31 SFLAG31
#> 1      NA               E   -9999      NA      NA        
#> 2      NA               E      57      NA      NA       E

You can also get to datasets by searching by station id, date min, date max, and variable. E.g.

ghcnd_search("AGE00147704", var = "PRCP")
#> $prcp
#> Source: local data frame [9,803 x 6]
#> 
#>             id  prcp       date mflag qflag sflag
#> 1  AGE00147704 -9999 1909-11-01    NA            
#> 2  AGE00147704    23 1909-12-01    NA           E
#> 3  AGE00147704    81 1910-01-01    NA           E
#> 4  AGE00147704     0 1910-02-01    NA           E
#> 5  AGE00147704    18 1910-03-01    NA           E
#> 6  AGE00147704     0 1910-04-01    NA           E
#> 7  AGE00147704   223 1910-05-01    NA           E
#> 8  AGE00147704     0 1910-06-01    NA           E
#> 9  AGE00147704     0 1910-07-01    NA           E
#> 10 AGE00147704     0 1910-08-01    NA           E
#> ..         ...   ...        ...   ...   ...   ...

ISD

  • ISD = Integrated Surface Database
  • Data comes from an FTP server

You'll likely first want to run isd_stations() to get list of stations

stations <- isd_stations()
head(stations)
#>   usaf  wban station_name ctry state icao lat lon elev_m    begin      end
#> 1 7005 99999   CWOS 07005                  NA  NA     NA 20120127 20120127
#> 2 7011 99999   CWOS 07011                  NA  NA     NA 20111025 20121129
#> 3 7018 99999   WXPOD 7018                   0   0   7018 20110309 20130730
#> 4 7025 99999   CWOS 07025                  NA  NA     NA 20120127 20120127
#> 5 7026 99999   WXPOD 7026   AF              0   0   7026 20120713 20141120
#> 6 7034 99999   CWOS 07034                  NA  NA     NA 20121024 20121106

Then get data from particular stations, like

(res <- isd(usaf = "011490", wban = "99999", year = 1986))
#> <ISD Data>
#> Size: 1328 X 85
#> 
#>    total_chars usaf_station wban_station     date time date_flag latitude
#> 1           50        11490        99999 19860101    0         4    66267
#> 2          123        11490        99999 19860101  600         4    66267
#> 3           50        11490        99999 19860101 1200         4    66267
#> 4           94        11490        99999 19860101 1800         4    66267
#> 5           50        11490        99999 19860102    0         4    66267
#> 6          123        11490        99999 19860102  600         4    66267
#> 7           50        11490        99999 19860102 1200         4    66267
#> 8           94        11490        99999 19860102 1800         4    66267
#> 9           50        11490        99999 19860103    0         4    66267
#> 10         123        11490        99999 19860103  600         4    66267
#> ..         ...          ...          ...      ...  ...       ...      ...
#> Variables not shown: longitude (int), type_code (chr), elevation (int),
#>      call_letter (int), quality (chr), wind_direction (int),
#>      wind_direction_quality (int), wind_code (chr), wind_speed (int),
#>      wind_speed_quality (int), ceiling_height (int),
#>      ceiling_height_quality (int), ceiling_height_determination (chr),
#>      ceiling_height_cavok (chr), visibility_distance (int),
#>      visibility_distance_quality (int), visibility_code (chr),
#>      visibility_code_quality (int), temperature (int), temperature_quality
#>      (int), temperature_dewpoint (int), temperature_dewpoint_quality
#>      (int), air_pressure (int), air_pressure_quality (int),
#>      AG1.precipitation (chr), AG1.discrepancy (int), AG1.est_water_depth
#>      (int), GF1.sky_condition (chr), GF1.coverage (int),
#>      GF1.opaque_coverage (int), GF1.coverage_quality (int),
#>      GF1.lowest_cover (int), GF1.lowest_cover_quality (int),
#>      GF1.low_cloud_genus (int), GF1.low_cloud_genus_quality (int),
#>      GF1.lowest_cloud_base_height (int),
#>      GF1.lowest_cloud_base_height_quality (int), GF1.mid_cloud_genus
#>      (int), GF1.mid_cloud_genus_quality (int), GF1.high_cloud_genus (int),
#>      GF1.high_cloud_genus_quality (int), MD1.atmospheric_change (chr),
#>      MD1.tendency (int), MD1.tendency_quality (int), MD1.three_hr (int),
#>      MD1.three_hr_quality (int), MD1.twentyfour_hr (int),
#>      MD1.twentyfour_hr_quality (int), REM.remarks (chr), REM.identifier
#>      (chr), REM.length_quantity (int), REM.comment (chr), KA1.extreme_temp
#>      (chr), KA1.period_quantity (int), KA1.max_min (chr), KA1.temp (int),
#>      KA1.temp_quality (int), AY1.manual_occurrence (chr),
#>      AY1.condition_code (int), AY1.condition_quality (int), AY1.period
#>      (int), AY1.period_quality (int), AY2.manual_occurrence (chr),
#>      AY2.condition_code (int), AY2.condition_quality (int), AY2.period
#>      (int), AY2.period_quality (int), MW1.first_weather_reported (chr),
#>      MW1.condition (int), MW1.condition_quality (int),
#>      EQD.observation_identifier (chr), EQD.observation_text (int),
#>      EQD.reason_code (int), EQD.parameter (chr),
#>      EQD.observation_identifier.1 (chr), EQD.observation_text.1 (int),
#>      EQD.reason_code.1 (int), EQD.parameter.1 (chr)

Severe weather

  • SWDI = Severe Weather Data Inventory
  • From the SWDI site

The Severe Weather Data Inventory (SWDI) is an integrated database of severe weather records for the United States. The records in SWDI come from a variety of sources in the NCDC archive.

The swdi() function allows you to get data in xml, csv, shp, or kmz format. You can get data from many different datasets:

  • nx3tvs NEXRAD Level-3 Tornado Vortex Signatures (point)
  • nx3meso NEXRAD Level-3 Mesocyclone Signatures (point)
  • nx3hail NEXRAD Level-3 Hail Signatures (point)
  • nx3structure NEXRAD Level-3 Storm Cell Structure Information (point)
  • plsr Preliminary Local Storm Reports (point)
  • warn Severe Thunderstorm, Tornado, Flash Flood and Special Marine warnings (polygon)
  • nldn Lightning strikes from Vaisala (.gov and .mil ONLY) (point)

An example: Get all plsr within the bounding box (-91,30,-90,31)

swdi(dataset = 'plsr', startdate = '20060505', enddate = '20060510',
bbox = c(-91, 30, -90, 31))
#> $meta
#> $meta$totalCount
#> numeric(0)
#> 
#> $meta$totalTimeInSeconds
#> [1] 0
#> 
#> 
#> $data
#> Source: local data frame [5 x 8]
#> 
#>                  ztime     id        event magnitude            city
#> 1 2006-05-09T02:20:00Z 427540         HAIL         1    5 E KENTWOOD
#> 2 2006-05-09T02:40:00Z 427536         HAIL         1    MOUNT HERMAN
#> 3 2006-05-09T02:40:00Z 427537 TSTM WND DMG     -9999    MOUNT HERMAN
#> 4 2006-05-09T03:00:00Z 427199         HAIL         0     FRANKLINTON
#> 5 2006-05-09T03:17:00Z 427200      TORNADO     -9999 5 S FRANKLINTON
#> Variables not shown: county (chr), state (chr), source (chr)
#> 
#> $shape
#>                  shape
#> 1 POINT (-90.43 30.93)
#> 2  POINT (-90.3 30.96)
#> 3  POINT (-90.3 30.96)
#> 4 POINT (-90.14 30.85)
#> 5 POINT (-90.14 30.78)
#> 
#> attr(,"class")
#> [1] "swdi"

Sea ice

The seaice() function simply grabs shape files that describe sea ice cover at the Northa and South poles, and can be useful for examining change through time in sea ice cover, among other things.

An example: Plot sea ice cover for April 1990 for the North pole.

urls <- seaiceeurls(mo = 'Apr', pole = 'N', yr = 1990)
out <- seaice(urls)

library('ggplot2')
ggplot(out, aes(long, lat, group = group)) +
   geom_polygon(fill = "steelblue") +
   theme_ice()

plot of chunk unnamed-chunk-13

Buoys

  • Get NOAA buoy data from the National Buoy Data Center

Using buoy data requires the ncdf package. Make sure you have that installed, like install.packages("ncdf"). buoy() and buoys() will fail if you don't have ncdf installed.

buoys() - Get available buoys given a dataset name

head(buoys(dataset = 'cwind'))
#>      id
#> 1 41001
#> 2 41002
#> 3 41004
#> 4 41006
#> 5 41008
#> 6 41009
#>                                                                       url
#> 1 http://dods.ndbc.noaa.gov/thredds/catalog/data/cwind/41001/catalog.html
#> 2 http://dods.ndbc.noaa.gov/thredds/catalog/data/cwind/41002/catalog.html
#> 3 http://dods.ndbc.noaa.gov/thredds/catalog/data/cwind/41004/catalog.html
#> 4 http://dods.ndbc.noaa.gov/thredds/catalog/data/cwind/41006/catalog.html
#> 5 http://dods.ndbc.noaa.gov/thredds/catalog/data/cwind/41008/catalog.html
#> 6 http://dods.ndbc.noaa.gov/thredds/catalog/data/cwind/41009/catalog.html

buoy() - Get data for a buoy - if no year or datatype specified, we get the first file

buoy(dataset = 'cwind', buoyid = 46085)
#> Dimensions (rows/cols): [33486 X 5] 
#> 2 variables: [wind_dir, wind_spd] 
#> 
#>                    time latitude longitude wind_dir wind_spd
#> 1  2007-05-05T02:00:00Z   55.855  -142.559      331      2.8
#> 2  2007-05-05T02:10:00Z   55.855  -142.559      328      2.6
#> 3  2007-05-05T02:20:00Z   55.855  -142.559      329      2.2
#> 4  2007-05-05T02:30:00Z   55.855  -142.559      356      2.1
#> 5  2007-05-05T02:40:00Z   55.855  -142.559      360      1.5
#> 6  2007-05-05T02:50:00Z   55.855  -142.559       10      1.9
#> 7  2007-05-05T03:00:00Z   55.855  -142.559       10      2.2
#> 8  2007-05-05T03:10:00Z   55.855  -142.559       14      2.2
#> 9  2007-05-05T03:20:00Z   55.855  -142.559       16      2.1
#> 10 2007-05-05T03:30:00Z   55.855  -142.559       22      1.6
#> ..                  ...      ...       ...      ...      ...

Tornadoes

The function tornadoes() gets tornado data from http://www.spc.noaa.gov/gis/svrgis/.

shp <- tornadoes()
library('sp')
plot(shp)

tornadoes

Historical Observing Metadata Repository

homr_definitions() gets you definitions and metadata for datasets

head(homr_definitions())
#> Source: local data frame [6 x 7]
#> 
#>   defType  abbr                fullName    displayName
#> 1     ids GHCND        GHCND IDENTIFIER       GHCND ID
#> 2     ids  COOP             COOP NUMBER        COOP ID
#> 3     ids  WBAN             WBAN NUMBER        WBAN ID
#> 4     ids   FAA FAA LOCATION IDENTIFIER         FAA ID
#> 5     ids  ICAO                 ICAO ID        ICAO ID
#> 6     ids TRANS          TRANSMITTAL ID Transmittal ID
#> Variables not shown: description (chr), cssaName (chr), ghcndName (chr)

homr() gets you metadata for stations given query parameters. In this example, search for data for the state of Delaware

res <- homr(state = 'DE')
names(res) # the stations
#>  [1] "10001871" "10100162" "10100164" "10100166" "20004155" "20004158"
#>  [7] "20004160" "20004162" "20004163" "20004168" "20004171" "20004176"
#> [13] "20004178" "20004179" "20004180" "20004182" "20004184" "20004185"
#> [19] "30001831" "30017384" "30020917" "30021161" "30021998" "30022674"
#> [25] "30026770" "30027455" "30032423" "30032685" "30034222" "30039554"
#> [31] "30043742" "30046662" "30046814" "30051475" "30057217" "30063570"
#> [37] "30064900" "30065901" "30067636" "30069663" "30075067" "30077378"
#> [43] "30077857" "30077923" "30077988" "30079088" "30079240" "30082430"
#> [49] "30084216" "30084262" "30084537" "30084796" "30094582" "30094639"
#> [55] "30094664" "30094670" "30094683" "30094730" "30094806" "30094830"
#> [61] "30094917" "30094931" "30094936" "30094991"

You can index to each one to get more data

Storms

  • Data from: International Best Track Archive for Climate Stewardship (IBTrACS)
  • Data comes from an FTP server

Flat files (csv's) are served up as well as shp files. In this example, plot storm data for the year 1940

(res3 <- storm_shp(year = 1940))
#> <NOAA Storm Shp Files>
#> Path: ~/.rnoaa/storms/year/Year.1940.ibtracs_all_points.v03r06.shp
#> Basin: <NA>
#> Storm: <NA>
#> Year: 1940
#> Type: points
res3shp <- storm_shp_read(res3)
sp::plot(res3shp)

plot of chunk unnamed-chunk-19

rerddap - General purpose R client for ERDDAP servers

ERDDAP is a data server that gives you a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps. Besides it’s own RESTful interface, much of which is designed based on OPeNDAP, ERDDAP can act as an OPeNDAP server and as a WMS server for gridded data.

ERDDAP is a powerful tool - in a world of heterogeneous data, it's often hard to combine data and serve it through the same interface, with tools for querying/filtering/subsetting the data. That is exactly what ERDDAP does. Heterogeneous data sets often have some similarities, such as latitude/longitude data and usually a time component, but other variables vary widely.

NetCDF

rerddap supports NetCDF format, and is the default when using the griddap() function. We use ncdf by default, but you can choose to use ncdf4 instead.

Caching

Data files downloaded are cached in a single hidden directory ~/.rerddap on your machine. It's hidden so that you don't accidentally delete the data, but you can still easily delete the data if you like, open files, move them around, etc.

When you use griddap() or tabledap() functions, we construct a MD5 hash from the base URL, and any query parameters - this way each query is separately cached. Once we have the hash, we look in ~/.rerddap for a matching hash. If there's a match we use that file on disk - if no match, we make a http request for the data to the ERDDAP server you specify.

ERDDAP servers

You can get a data.frame of ERDDAP servers using the function servers(). Most I think serve some kind of NOAA data, but there are a few that aren't NOAA data. Here are a few:

head(servers())
#>                                                                                            name
#> 1                                                         Marine Domain Awareness (MDA) - Italy
#> 2                                                                    Marine Institute - Ireland
#> 3                                                      CoastWatch Caribbean/Gulf of Mexico Node
#> 4                                                                    CoastWatch West Coast Node
#> 5                    NOAA IOOS CeNCOOS (Central and Northern California Ocean Observing System)
#> 6 NOAA IOOS NERACOOS (Northeastern Regional Association of Coastal and Ocean Observing Systems)
#>                                        url
#> 1 https://bluehub.jrc.ec.europa.eu/erddap/
#> 2          http://erddap.marine.ie/erddap/
#> 3      http://cwcgom.aoml.noaa.gov/erddap/
#> 4  http://coastwatch.pfeg.noaa.gov/erddap/
#> 5    http://erddap.axiomalaska.com/erddap/
#> 6          http://www.neracoos.org/erddap/

Install

From CRAN

install.packages("rerddap")

Or development version from GitHub

devtools::install_github("ropensci/rerddap")
library('rerddap')

Search

First, you likely want to search for data, specifying whether to search for either griddadp or tabledap datasets. The default is griddap.

ed_search(query = 'size', which = "table")
#> 11 results, showing first 20 
#>                                                                                         title
#> 1                                                                          CalCOFI Fish Sizes
#> 2                                                                        CalCOFI Larvae Sizes
#> 3                Channel Islands, Kelp Forest Monitoring, Size and Frequency, Natural Habitat
#> 4                                                         CalCOFI Larvae Counts Positive Tows
#> 5                                                                                CalCOFI Tows
#> 7                                                  OBIS - ARGOS Satellite Tracking of Animals
#> 8                                                     GLOBEC NEP MOCNESS Plankton (MOC1) Data
#> 9                                                 GLOBEC NEP Vertical Plankton Tow (VPT) Data
#> 10                            NWFSC Observer Fixed Gear Data, off West Coast of US, 2002-2006
#> 11                                 NWFSC Observer Trawl Data, off West Coast of US, 2002-2006
#> 12 AN EXPERIMENTAL DATASET: Underway Sea Surface Temperature and Salinity Aboard the Oleander
#>             dataset_id
#> 1     erdCalCOFIfshsiz
#> 2     erdCalCOFIlrvsiz
#> 3       erdCinpKfmSFNH
#> 4  erdCalCOFIlrvcntpos
#> 5       erdCalCOFItows
#> 7            aadcArgos
#> 8        erdGlobecMoc1
#> 9         erdGlobecVpt
#> 10  nwioosObsFixed2002
#> 11  nwioosObsTrawl2002
#> 12            nodcPJJU
ed_search(query = 'size', which = "grid")
#> 6 results, showing first 20 
#>                                                                                                   title
#> 6                                                       NOAA Global Coral Bleaching Monitoring Products
#> 13        USGS COAWST Forecast, US East Coast and Gulf of Mexico (Experimental) [time][eta_rho][xi_rho]
#> 14            USGS COAWST Forecast, US East Coast and Gulf of Mexico (Experimental) [time][eta_u][xi_u]
#> 15            USGS COAWST Forecast, US East Coast and Gulf of Mexico (Experimental) [time][eta_v][xi_v]
#> 16 USGS COAWST Forecast, US East Coast and Gulf of Mexico (Experimental) [time][s_rho][eta_rho][xi_rho]
#> 17  USGS COAWST Forecast, US East Coast and Gulf of Mexico (Experimental) [time][Nbed][eta_rho][xi_rho]
#>             dataset_id
#> 6             NOAA_DHW
#> 13 whoi_ed12_89ce_9592
#> 14 whoi_61c3_0b5d_cd61
#> 15 whoi_62d0_9d64_c8ff
#> 16 whoi_7dd7_db97_4bbe
#> 17 whoi_a4fb_2c9c_16a7

This gives back dataset titles and identifiers - with which you should be able to get a sense for which dataset you may want to fetch.

Information

After searching you can get more information on a single dataset

info('whoi_62d0_9d64_c8ff')
#> <ERDDAP info> whoi_62d0_9d64_c8ff 
#>  Dimensions (range):  
#>      time: (2012-06-25T01:00:00Z, 2015-06-24T00:00:00Z) 
#>      eta_v: (0, 334) 
#>      xi_v: (0, 895) 
#>  Variables:  
#>      bedload_Vsand_01: 
#>          Units: kilogram meter-1 s-1 
#>      bedload_Vsand_02: 
#>          Units: kilogram meter-1 s-1 
...

Which is a simple S3 list but prints out pretty, so it's easy to quickly scan the printed output and see what you need to see to proceed. That is, in the next step you want to get the dataset, and you'll want to specify your search using some combination of values for latitude, longitude, and time.

griddap (gridded) data

First, get information on a dataset to see time range, lat/long range, and variables.

(out <- info('noaa_esrl_027d_0fb5_5d38'))
#> <ERDDAP info> noaa_esrl_027d_0fb5_5d38 
#>  Dimensions (range):  
#>      time: (1850-01-01T00:00:00Z, 2014-05-01T00:00:00Z) 
#>      latitude: (87.5, -87.5) 
#>      longitude: (-177.5, 177.5) 
#>  Variables:  
#>      air: 
#>          Range: -20.9, 19.5 
#>          Units: degC

Then query for gridded data using the griddap() function

(res <- griddap(out,
  time = c('2012-01-01', '2012-01-30'),
  latitude = c(21, 10),
  longitude = c(-80, -70)
))
#> <ERDDAP griddap> noaa_esrl_027d_0fb5_5d38
#>    Path: [~/.rerddap/648ed11e8b911b65e39eb63c8df339df.nc]
#>    Last updated: [2015-05-09 08:31:10]
#>    File size:    [0 mb]
#>    Dimensions (dims/vars):   [3 X 1]
#>    Dim names: time, latitude, longitude
#>    Variable names: CRUTEM3: Surface Air Temperature Monthly Anomaly
#>    data.frame (rows/columns):   [18 X 4]
#>                    time latitude longitude  air
#> 1  2012-01-01T00:00:00Z     22.5     -77.5   NA
#> 2  2012-01-01T00:00:00Z     22.5     -77.5   NA
#> 3  2012-01-01T00:00:00Z     22.5     -77.5   NA
#> 4  2012-01-01T00:00:00Z     22.5     -77.5 -0.1
#> 5  2012-01-01T00:00:00Z     22.5     -77.5   NA
#> 6  2012-01-01T00:00:00Z     22.5     -77.5 -0.2
#> 7  2012-01-01T00:00:00Z     17.5     -72.5  0.2
#> 8  2012-01-01T00:00:00Z     17.5     -72.5   NA
#> 9  2012-01-01T00:00:00Z     17.5     -72.5  0.3
#> 10 2012-02-01T00:00:00Z     17.5     -72.5   NA
#> ..                  ...      ...       ...  ...

The output of griddap() is a list that you can explore further. Get the summary

res$summary
#> [1] "file ~/.rerddap/648ed11e8b911b65e39eb63c8df339df.nc has 3 dimensions:"
#> [1] "time   Size: 2"
#> [1] "latitude   Size: 3"
#> [1] "longitude   Size: 3"
#> [1] "------------------------"
#> [1] "file ~/.rerddap/648ed11e8b911b65e39eb63c8df339df.nc has 1 variables:"
#> [1] "float air[longitude,latitude,time]  Longname:CRUTEM3: Surface Air Temperature Monthly Anomaly Missval:-9.96920996838687e+36"

Or get the dimension variables (just the names of the variables for brevity here)

names(res$summary$dim)
#> [1] "time"      "latitude"  "longitude"

Get the data.frame (beware: you may want to just look at the head of the data.frame if large)

res$data
#>                    time latitude longitude   air
#> 1  2012-01-01T00:00:00Z     22.5     -77.5    NA
#> 2  2012-01-01T00:00:00Z     22.5     -77.5    NA
#> 3  2012-01-01T00:00:00Z     22.5     -77.5    NA
#> 4  2012-01-01T00:00:00Z     22.5     -77.5 -0.10
#> 5  2012-01-01T00:00:00Z     22.5     -77.5    NA
#> 6  2012-01-01T00:00:00Z     22.5     -77.5 -0.20
#> 7  2012-01-01T00:00:00Z     17.5     -72.5  0.20
#> 8  2012-01-01T00:00:00Z     17.5     -72.5    NA
#> 9  2012-01-01T00:00:00Z     17.5     -72.5  0.30
#> 10 2012-02-01T00:00:00Z     17.5     -72.5    NA
#> 11 2012-02-01T00:00:00Z     17.5     -72.5    NA
#> 12 2012-02-01T00:00:00Z     17.5     -72.5    NA
#> 13 2012-02-01T00:00:00Z     12.5     -67.5  0.40
#> 14 2012-02-01T00:00:00Z     12.5     -67.5    NA
#> 15 2012-02-01T00:00:00Z     12.5     -67.5  0.20
#> 16 2012-02-01T00:00:00Z     12.5     -67.5  0.00
#> 17 2012-02-01T00:00:00Z     12.5     -67.5    NA
#> 18 2012-02-01T00:00:00Z     12.5     -67.5  0.32

You can actually still explore the original netcdf summary object, e.g.,

res$summary$dim$time
#> $name
#> [1] "time"
#> 
#> $len
#> [1] 2
#> 
#> $unlim
#> [1] FALSE
#> 
#> $id
#> [1] 1
#> 
#> $dimvarid
#> [1] 1
#> 
#> $units
#> [1] "seconds since 1970-01-01T00:00:00Z"
#> 
#> $vals
#> [1] 1325376000 1328054400
#> 
#> $create_dimvar
#> [1] TRUE
#> 
#> attr(,"class")
#> [1] "dim.ncdf"

tabledap (tabular) data

tabledap is data that is not gridded by lat/lon/time. In addition, the query interface is a bit different. Notice that you can do less than, more than, equal to type queries, but they are specified as character strings.

(out <- info('erdCalCOFIfshsiz'))
#> <ERDDAP info> erdCalCOFIfshsiz 
#>  Variables:  
#>      calcofi_species_code: 
#>          Range: 19, 1550 
#>      common_name: 
#>      cruise: 
#>      fish_1000m3: 
#>          Units: Fish per 1,000 cubic meters of water sampled 
#>      fish_count: 
#>      fish_size: 
...
(dat <- tabledap(out, 'time>=2001-07-07', 'time<=2001-07-10', 
                 fields = c('longitude', 'latitude', 'fish_size', 'itis_tsn', 'scientific_name')))
#> <ERDDAP tabledap> erdCalCOFIfshsiz
#>    Path: [~/.rerddap/f013f9ee09bdb4184928d533e575e948.csv]
#>    Last updated: [2015-05-09 08:31:21]
#>    File size:    [0.03 mb]
#>    Dimensions:   [558 X 5]
#> 
#>     longitude  latitude fish_size itis_tsn       scientific_name
#> 2     -118.26    33.255      22.9   623745 Nannobrachium ritteri
#> 3     -118.26    33.255      22.9   623745 Nannobrachium ritteri
#> 4  -118.10667 32.738335      31.5   623625  Lipolagus ochotensis
#> 5  -118.10667 32.738335      48.3   623625  Lipolagus ochotensis
#> 6  -118.10667 32.738335      15.5   162221 Argyropelecus sladeni
#> 7  -118.10667 32.738335      16.3   162221 Argyropelecus sladeni
#> 8  -118.10667 32.738335      17.8   162221 Argyropelecus sladeni
#> 9  -118.10667 32.738335      18.2   162221 Argyropelecus sladeni
#> 10 -118.10667 32.738335      19.2   162221 Argyropelecus sladeni
#> 11 -118.10667 32.738335      20.0   162221 Argyropelecus sladeni
#> ..        ...       ...       ...      ...                   ...

Since both griddap() and tabledap() give back data.frame's, it's easy to do downstream manipulation. For example, we can use dplyr to filter, summarize, group, and sort:

library("dplyr")
dat$fish_size <- as.numeric(dat$fish_size)
df <- tbl_df(dat) %>% 
  filter(fish_size > 30) %>% 
  group_by(scientific_name) %>% 
  summarise(mean_size = mean(fish_size)) %>% 
  arrange(desc(mean_size))
df
#> Source: local data frame [20 x 2]
#> 
#>                 scientific_name mean_size
#> 1       Idiacanthus antrostomus 253.00000
#> 2            Stomias atriventer 189.25000
#> 3            Lestidiops ringens  98.70000
#> 4     Tarletonbeania crenularis  56.50000
#> 5      Ceratoscopelus townsendi  53.70000
#> 6     Stenobrachius leucopsarus  47.74538
#> 7               Sardinops sagax  47.00000
#> 8         Nannobrachium ritteri  43.30250
#> 9         Bathylagoides wesethi  43.09167
#> 10         Vinciguerria lucetia  42.00000
#> 11       Cyclothone acclinidens  40.80000
#> 12         Lipolagus ochotensis  39.72500
#> 13        Leuroglossus stilbius  38.35385
#> 14        Triphoturus mexicanus  38.21342
#> 15                Diaphus theta  37.88571
#> 16       Trachipterus altivelis  37.70000
#> 17 Symbolophorus californiensis  37.66000
#> 18         Nannobrachium regale  37.50000
#> 19         Merluccius productus  36.61333
#> 20        Argyropelecus sladeni  32.43333

Then make a cute little plot

library("ggplot2")
ggplot(df, aes(reorder(scientific_name, mean_size), mean_size)) +
  geom_bar(stat = "identity") +
  coord_flip() + 
  theme_grey(base_size = 20) +
  labs(y = "Mean Size", x = "Species")

plot of chunk unnamed-chunk-19