text mining, apis, and parsing api logs

Acquiring full text articles fulltext is an R package I maintain to obtain full text versions of research articles for text mining. It’s a hard problem, with a spaghetti web of code. One of the hard problems is figuring out what the URL is for the full text version of an article. Publishers do not have consistent URL patterns through time, and so you can not set rules once and never revisit them. ...

March 21, 2019 · 7 min · Scott Chamberlain

Exceptions in control flow in R

I was listening to a Bike Shed podcast episode 189, “It’s Gonna Work, Definitely, No Problems Whatsoever”, and starting at 27:44 there was a conversation about exception handling. Specifically it was about exception handling in control flow when doing web API requests. This topic piqued my interest straight away as I do a lot of API stuff (making and wrapping). The part of the conversation that I want to address is their conclusion that exceptions in control flow are an anti-pattern. Seems this is a general pattern in programming languages, e.g., this SO thread. But on the contrary there are some languages in which exceptions in control flow are considered normal behavior; e.g., Python (this, this). ...

March 4, 2019 · 9 min · Scott Chamberlain

Notes on porting Ruby to R

In doing a number of ports of Ruby gems to R (vcr, webmockr), I’ve noticed a few differences between the languages that are fun to dive into, at least for me. monkey patching Ruby has a nice thing where you can “monkey patch” classes/methods/etc. in other Ruby libraries. For example, lets say you have Ruby gems foo and bar. If foo has a method hello, you can override the hello method in foo with one from bar. AFAICT this is acceptable in gems on Rubygems.org and in general in the community. ...

February 19, 2019 · 4 min · Scott Chamberlain

trailing commas

Let’s talk about trailing commas (aka: “final commas”, “dangling commas”). Trailing commas refers to a comma at the end of a series of values in an array or array like object, leaving an essentially empty slot. e.g., [1, 2, 3, ] I kind of like them when I work on Ruby and Python projects. A number of advantages of trailing commas have been pointed out, the most common of which is diffs: ...

February 7, 2019 · 3 min · Scott Chamberlain

condition control: I just want that message once

I’m sure there’s already a way to do this, but here goes. OR maybe this is an anti-pattern. Either way, this is me, asking the stupid question. I ran into this a few hours ago: Sys.unsetenv("ENTREZ_KEY") library(brranching) mynames <- c("Poa annua", "Salix goodingii", "Helianthus annuus") phylomatic_names(taxa = mynames, format='rsubmit') No ENTREZ API key provided Get one via taxize::use_entrez() See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/ No ENTREZ API key provided Get one via taxize::use_entrez() See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/ No ENTREZ API key provided Get one via taxize::use_entrez() See https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/ [1] "poaceae%2Fpoa%2Fpoa_annua" "salicaceae%2Fsalix%2Fsalix_goodingii" "asteraceae%2Fhelianthus%2Fhelianthus_annuus" The brranching package uses the taxize package internally, calling it’s function taxize::tax_name(). The taxize::tax_name() function throws useful messages to the user if their NCBI Entrez API key is not found, and gives them instructions on how to find it. ...

December 6, 2018 · 4 min · Scott Chamberlain

limiting dependencies in R package development

The longer you do anything, the more preferences you may develop for that thing. One of these things is making R packages. One preference I’ve developed is in limiting package dependencies - or at least limiting to the least painful dependencies - in the packages I maintain. Ideally, if a base R solution can be done then do it that way. Everybody has base R packages if they are using R, so you can’t fail there, at least on package installation. ...

October 2, 2018 · 5 min · Scott Chamberlain

Balancing user friendliness and code fragility

I occasionally think about these various topics and ping back and forth between them, thinking I’ve got to make a package more user friendly, then back to thinking oh, I really should make this package easier to maintain, but what if that makes it less user friendly? I’ve wanted to get these thoughts written down for a while now, so here goes. User friendliness and code fragility It’s an unassailable good to make your code more user friendly. There’s no point of making your package harder to use unless you really don’t want people using it. ...

July 27, 2018 · 5 min · Scott Chamberlain

Exploring specimen collections data in Butte County, California

Why Butte County? I went to college at California State University, Chico - in Butte County, CA. I did a BA degree in Biology there. It was a great program as it was heavily focused on natural history - with classes on herps, birds, insects, fish, etc. Specimen collections data Specimen collections data are increasingly being digitized, and often accessed via largeish platforms like GBIF and iDigBio. Here I’ll explore Butte County data found with iDigBio with the spocc R package. You could also use the ridigbio package to go directly to iDigBio data. ...

June 12, 2018 · 5 min · Scott Chamberlain

Exploring git commits with git2r

In rOpenSci - as in presumably most open source projects - we want the entire project to be sustainable, but also each individual software project to be sustainable. A big part of each software project (aka R package in this case) being sustainable is the people making it, particularly whether: how many contributors a project has, and how contributions are spread across contibutors There are discussions going on about how to increase contributors to any given project. But the first thing to do is to do an assesment of where you’re at. One way to do that is visualization. ...

February 5, 2018 · 4 min · Scott Chamberlain

My Sublime Text workflow/setup

Sublime Text is pretty great. Let’s start at the beginning. Why would my primary editing tool not be vim? My background is as a biologist, spending way to many years in grad school. My first programming language was R back in 2006; my first text editor about the same year was Notepad++; my first interaction with the cli was probably a year later or so (but that was on Windows). After using Notepad++ for a few years, I stumbled upon Sublime Text via advice from a friend. I used it for a few years without paying (which you can still do), and after that realized it was worth paying for. They now have an easy to use Discourse forum too. ...

January 31, 2018 · 3 min · Scott Chamberlain