pygbif - GBIF client for Python

I maintain an R client for the GBIF API, at rgbif. Been working on it for a few years, and recently been thinking that there should be a nice low level client for Python as well. I didn't see one searching Github, etc. so I started working on one recently: pygbif

It's up on pypi.

There's not much in pygbif yet - I wanted to get something up to start getting some users to more quickly make the library useful to people.

There's three modules, with a few methods each:

  • species
    • name_backbone()
    • name_suggest()
  • registry
    • nodes()
    • dataset_metrics()
    • datasets()
  • occurrences
    • search()
    • get()
    • get_verbatim()
    • get_fragment()
    • count()
    • count_basisofrecord()
    • count_year()
    • count_datasets()
    • count_countries()
    • count_publishingcountries()
    • count_schema()

Here's a quick intro (in a Jupyter notebook):


pip install pygbif


from pygbif import registry
{u'colCoveragePct': 79,
 u'colMatchingCount': 24335,
 u'countByConstituent': {},
 u'countByIssue': {u'BACKBONE_MATCH_FUZZY': 573,
 u'countByKingdom': {u'ANIMALIA': 30,
  u'FUNGI': 3,
  u'PLANTAE': 10997,
  u'PROTOZOA': 1},

Taxonomic names

from pygbif import species
species.name_suggest(q='Puma concolor', limit = 1)
{'data': [{u'canonicalName': u'Puma concolor',
   u'class': u'Mammalia',
   u'classKey': 359,
   u'family': u'Felidae',
   u'familyKey': 9703,
   u'genus': u'Puma',
   u'genusKey': 2435098,
   u'key': 2435099,
   u'kingdom': u'Animalia',
   u'kingdomKey': 1,
   u'nubKey': 2435099,
   u'order': u'Carnivora',
   u'orderKey': 732,
   u'parent': u'Puma',
   u'parentKey': 2435098,
   u'phylum': u'Chordata',
   u'phylumKey': 44,
   u'rank': u'SPECIES',
   u'species': u'Puma concolor',
   u'speciesKey': 2435099}],
 'hierarchy': [{u'1': u'Animalia',
   u'2435098': u'Puma',
   u'359': u'Mammalia',
   u'44': u'Chordata',
   u'732': u'Carnivora',
   u'9703': u'Felidae'}]}

Occurrence data


from pygbif import occurrences
res = = 3329049, limit = 10)
[ x['phylum'] for x in res['results'] ]

Fetch specific occurrences

occurrences.get(key = 252408386)
{u'basisOfRecord': u'OBSERVATION',
 u'catalogNumber': u'70875196',
 u'collectionCode': u'7472',
 u'continent': u'EUROPE',
 u'country': u'United Kingdom',
 u'countryCode': u'GB',
 u'datasetKey': u'26a49731-9457-45b2-9105-1b96063deb26',
 u'day': 30,

Occurrence counts API

occurrences.count(isGeoreferenced = True)


Would love any feedback...

noaa - Integrated Surface Database data

I've recently made some improvements to the functions that work with ISD (Integrated Surface Database) data.

isd data

  • The isd() function now caches more intelligently. We now cache using .rds files via saveRDS/readRDS, whereas we used to use .csv files, which take up much more disk space, and we have to worry about not changing data formats on reading data back into an R session. This has the downside that you can't just go directly to open up a cached file in your favorite spreadsheet viewer, but you can do that manually after reading in to R.
  • In addition, isd() now has a function cleanup, if TRUE after downloading the data file from NOAA's ftp server and processing, we delete the file. That's fine since we have the cached processed file. But you can choose not to cleanup the original data files.
  • Data processing in isd() is improved as well. We convert key variables to appropriate classes to be more useful.

isd stations

  • In isd_stations(), there's now a cached version of the station data in the package, or you can get optionally get fresh station data from NOAA's FTP server.
  • There's a new function isd_stations_search() that uses the station data to allow you to search for stations via either:
    • A bounding box
    • Radius froma point


For examples below, you'll need the development version:


Load rnoaa


ISD stations

Get stations

There's a cached version of the station data in the package, or you can get fresh station data from NOAA's FTP server.

stations <- isd_stations()
#>   usaf  wban station_name ctry state icao lat lon elev_m    begin      end
#> 1 7005 99999   CWOS 07005                  NA  NA     NA 20120127 20120127
#> 2 7011 99999   CWOS 07011                  NA  NA     NA 20111025 20121129
#> 3 7018 99999   WXPOD 7018                   0   0   7018 20110309 20130730
#> 4 7025 99999   CWOS 07025                  NA  NA     NA 20120127 20120127
#> 5 7026 99999   WXPOD 7026   AF              0   0   7026 20120713 20141120
#> 6 7034 99999   CWOS 07034                  NA  NA     NA 20121024 20121106

Filter and visualize stations

In addition to getting the entire station data.frame, you can also search for stations, either with a bounding box or within a radius from a point. First, the bounding box

bbox <- c(-125.0, 38.4, -121.8, 40.9)
out <- isd_stations_search(bbox = bbox)
#>     usaf  wban                          station_name ctry state icao
#> 2 724834 99999                        POINT CABRILLO   US    CA     
#> 3 724953 99999                              RIO NIDO   US    CA     
#> 4 724957 23213                 SONOMA COUNTY AIRPORT   US    CA KSTS
#> 5 724957 99999                  C M SCHULZ SONOMA CO   US    CA KSTS
#> 6 724970 99999                  CHICO CALIFORNIA MAP   US    CA  CIC
#>   elev_m    begin      end      lon    lat
#> 1  716.0 20101030 20150831 -122.922 40.747
#> 2   20.0 19810906 19871007 -123.820 39.350
#> 3 -999.0 19891111 19900303 -122.917 38.517
#> 4   34.8 20000101 20150831 -122.810 38.504
#> 5   38.0 19430404 19991231 -122.817 38.517
#> 6   69.0 19420506 19760305 -121.850 39.783

Where is the bounding box? (you'll need lawn, or you can vizualize some other way)

lawn::lawn_bbox_polygon(bbox) %>% view


Vizualize station subset - yep, looks right

leaflet(data = out) %>%
  addTiles() %>%


Next, search with a lat/lon coordinate, with a radius. That is, we search for stations within X km from the coordinate.

out <- isd_stations_search(lat = 38.4, lon = -123, radius = 250)
#>     usaf  wban             station_name ctry state icao elev_m    begin
#> 1 690070 93217            FRITZSCHE AAF   US    CA KOAR   43.0 19600404
#> 2 720267 23224 AUBURN MUNICIPAL AIRPORT   US    CA KAUN  466.7 20060101
#> 3 720267 99999         AUBURN MUNICIPAL   US    CA KAUN  468.0 20040525
#> 4 720406 99999      GNOSS FIELD AIRPORT   US    CA KDVO    0.6 20071114
#> 5 720576   174       UNIVERSITY AIRPORT   US    CA KEDU   21.0 20130101
#> 6 720576 99999                    DAVIS   US    CA KEDU   21.0 20080721
#>        end      lon    lat
#> 1 19930831 -121.767 36.683
#> 2 20150831 -121.082 38.955
#> 3 20051231 -121.082 38.955
#> 4 20150831 -122.550 38.150
#> 5 20150831 -121.783 38.533
#> 6 20121231 -121.783 38.533

Again, compare search area to stations found

search area

pt <- lawn::lawn_point(c(-123, 38.4))
lawn::lawn_buffer(pt, dist = 250) %>% view


stations found

leaflet(data = out) %>%
  addTiles() %>%


ISD data

Get ISD data

Here, I get data for four stations.

res1 <- isd(usaf="011690", wban="99999", year=1993)
res2 <- isd(usaf="172007", wban="99999", year=2015)
res3 <- isd(usaf="702700", wban="00489", year=2015)
res4 <- isd(usaf="109711", wban=99999, year=1970)

Then, combine data, with rnoaa:::rbind.isd()

res_all <- rbind(res1, res2, res3, res4)

Add date time

res_all$date_time <- ymd_hm(
  sprintf("%s %s", as.character(res_all$date), res_all$time)

Remove 999's (NOAA's way to indicate missing/no data)

res_all <- res_all %>% filter(temperature < 900)

Visualize ISD data

ggplot(res_all, aes(date_time, temperature)) +
  geom_line() + 
  facet_wrap(~usaf_station, scales = "free_x")


Metrics for open source projects

Measuring use of open source software isn't always straightforward. The problem is especially acute for software targeted largely at academia, where usage is not measured just by software downloads, but also by citations.

Citations are a well-known pain point because the citation graph is privately held by iron doors (e.g., Scopus, Google Scholar). New ventures aim to open up citation data, but of course it's an immense amount of work, and so does not come quickly.

The following is a laundry list of metrics on software of which I am aware, and some of which I use in our rOpenSci twice monthly updates.

I primarily develop software for the R language, so some of the metrics are specific to R, but many are not. In addition, we (rOpenSci) don't develop web apps, which may bring in an additional set of metrics not covered below.

I organize by source instead of type of data because some sources give multiple kinds of data - I note what kinds of data they give with labels.

CRAN downloads


  • Link:
  • This is a REST API for CRAN downloads from the RStudio CRAN CDN. Note however, that the RStudio CDN is only one of many - there are other mirrors users can insall packages from, and are not included in this count. However, a significant portion of downloads probably come from the RStudio CDN.
  • Other programming languages have similar support, e.g., Ruby and Node.


citations github social-media


citations github

  • Link:
  • This is a nascent venture by the ImpactStory team that seeks to uncover the impact of research software. As far as I can tell, they'll collect usage via software downloads and citations in the literature.

Web Site Analytics


  • If you happen to have a website for your project, collecting analytics is a way to gauge views of the landing page, and any help/tutorial pages you may have. A good easy way to do this is a deploy a basic site on your gh-pages branch of your GitHub repo, and use the easily integrated Google Analytics.
  • Whatever analytics you use, in my experience this mostly brings up links from google searches and blog posts that may mention your project
  • Google Analytics beacon (for README views): I haven't tried this yet, but seems promising.

Auomated tracking: SSNMP

citations github

  • Link:
  • Scientific Software Network Map Project
  • This is a cool NSF funded project by Chris Bogart that tracks software usage via GitHub and citations in literature.

Google Scholar


  • Link:
  • Searching Google Scholar for software citations manually is fine at a small scale, but at a larger scale scraping is best. However, you're not legally supposed to do this, and Google will shut you down.
  • Could try using g-scholar alerts as well, especially if new citations of your work are infrequent.
  • If you have institutional access to Scopus/Web of Science, you could search those, but I don't push this as an option since it's available to so few.




  • Support forums: Whether you use UserVoice, Discourse, Google Groups, Gitter, etc., depending on your viewpoint, these interactions could be counted as metrics of software usage.
  • Emails: I personally get a lot of emails asking for help with software I maintain. I imagine this is true for most software developers. Counting these could be another metric of software usage, although I never have counted mine.
  • Social media: See Lagotto above, which tracks some social media outlets.
  • Code coverage: There are many options now for code coverage, integrated with each Travis-CI build. A good option is CodeCov. CodeCov gives percentage test coverage, which one could use as one measure of code quality.
  • Reviews: There isn't a lot of code review going on that I'm aware of. Even if there was, I suppose this would just be a logical TRUE/FALSE.
  • Cash money y'all: Grants/consulting income/etc. could be counted as a metric.
  • Users: If you require users to create an account or similar before getting your software, you have a sense of number of users and perhaps their demographics.


Some software metrics things on the horizon that look interesting:


I'm sure I missed things. Let me know.