Recology

R/etc.

 

gbids - GenBank IDs API is back up!

GBIDS API is back

Back in March this year I wrote a post about a new API for working with GenBank IDs.

I had to take the API down because it was too expensive to keep up. Expensive because the dump of data is very large (3.8 GB compressed), and I need disk space on the server to uncompress that to I think about 18 GB, then load into MySQL, which is another maybe 30 GB or so. Anyway, it’s not expensive because of high traffic - although I wish that was the case - but because of needing lots of disk space.

I was fortuntate to recently receive some Amazon Cloud Credits for Research. The credits expire in 1 year. With these credits, I’ve put the GBIDS API back up. In the next year I’m hoping to gain user traction suggesting that’s is useful to enough people to keep maintaining - in which case I’ll seek ways to fund it.

But that means I need people to use it! So please to give it a try. Let me know what could be better; what could be faster; what API routes/features/etc. you’d like to see.

Plans

Plans for the future of the GBIDS API:

  • Auto-update the Genbank data. This is quite complicated since the dump is so large. I can either keep an EC2 attached disc large enough to do the dump download/expansion/load/etc, or spin up a new instance each Sunday when they do their data release, do the SQL load, make a dump, then shuttle the SQL dump to the instance running, then load in the new data from the dump. I haven’t got this bit running yet, so data is from Aug 7. 2016.
  • Add taxonomic IDs. Genbank also dumps their taxonomic IDs. I think it should be possible to get taxonomic IDs into the API, so that users can do accession number to taxon IDs and vice versa.
  • Performance: as anyone would want, I want to continually improve performance. I’ll watch out for things I can do, but also let me know what seems too slow.

Try it

Get 5 accession numbers

curl 'https://gbids.xyz/acc?limit=5' | jq .
#> {
#>   "matched": 692006925,
#>   "returned": 5,
#>   "data": [
#>     "A00002",
#>     "A00003",
#>     "X17276",
#>     "X60065",
#>     "CAA42669"
#>   ],
#>   "error": null
#> }

Request to match accession identifiers, some exist, while some do not

curl 'https://gbids.xyz/acc/AACY024124486,AACY024124483,asdfd,asdf,AACY024124476' | jq .
#> {
#>   "matched": 3,
#>   "returned": 5,
#>   "data": {
#>     "AACY024124486": true,
#>     "AACY024124483": true,
#>     "asdfd": false,
#>     "asdf": false,
#>     "AACY024124476": true
#>   },
#>   "error": null
#> }

There’s many more examples in the API docs

nonoyes - text analysis of Reply All podcast transcripts

Reply All is a great podcast. I’ve been wanting to learn some text analysis tools, and transcripts from the podcast are on their site.

Took some approaches outlined in the tidytext package in this vignette, and used the tokenizers package, and some of the tidyverse.

Code on github at sckott/nonoyes

Also check out the html version

Setup

Load deps

library("httr")
library("xml2")
library("stringi")
library("dplyr")
library("ggplot2")
library("tokenizers")
library("tidytext")
library("tidyr")

source helper functions

source("funs.R")

set base url

ra_base <- "https://gimletmedia.com/show/reply-all/episodes"

URLs

Make all urls for each page of episodes

urls <- c(ra_base, file.path(ra_base, "page", 2:8))

Get urls for each episode

res <- lapply(urls, get_urls)

Remove those that are rebroadcasts, updates, or revisited

res <- grep("rebroadcast|update|revisited", unlist(res), value = TRUE, invert = TRUE)

Episode names

Give some episodes numbers that don’t have them

epnames <- sub("/$", "", sub("https://gimletmedia.com/episode/", "", res))
epnames <- sub("the-anxiety-box", "8-the-anxiety-box", epnames)
epnames <- sub("french-connection", "10-french-connection", epnames)
epnames <- sub("ive-killed-people-and-i-have-hostages", "15-ive-killed-people-and-i-have-hostages", epnames)
epnames <- sub("6-this-proves-everything", "75-this-proves-everything", epnames)
epnames <- sub("zardulu", "56-zardulu", epnames)

Transcripts

Get transcripts

txts <- lapply(res, transcript_fetch, sleep = 1)

Parse transcripts

txtsp <- lapply(txts, transcript_parse)

Summary word usage

Summarise data for each transcript

dat <- stats::setNames(lapply(txtsp, function(m) {
  bind_rows(lapply(m, function(v) {
    tmp <- unname(vapply(v, nchar, 1))
    data_frame(
      n = length(tmp),
      mean = mean(tmp),
      n_laugh = count_word(v, "laugh"),
      n_groan = count_word(v, "groan")
    )
  }), .id = "name")
}), epnames)

Bind data together to single dataframe, and filter, summarise

data <- bind_rows(dat, .id = "episode") %>%
  filter(!is.na(episode)) %>%
  filter(grepl("^PJ$|^ALEX GOLDMAN$", name)) %>%
  mutate(ep_no = as.numeric(strextract(episode, "^[0-9]+"))) %>%
  group_by(ep_no) %>%
  mutate(nrow = NROW(ep_no)) %>%
  ungroup() %>%
  filter(nrow == 2)
data
#> # A tibble: 114 × 8
#>                 episode         name     n      mean n_laugh n_groan ep_no
#>                   <chr>        <chr> <int>     <dbl>   <int>   <int> <dbl>
#> 1            73-sandbox           PJ    89 130.65169       9       0    73
#> 2            73-sandbox ALEX GOLDMAN    25  44.00000       1       1    73
#> 3       72-dead-is-paul           PJ   137  67.77372      17       0    72
#> 4       72-dead-is-paul ALEX GOLDMAN    90  61.82222       8       0    72
#> 5  71-the-picture-taker           PJ    74  77.70270       3       0    71
#> 6  71-the-picture-taker ALEX GOLDMAN    93 105.94624       6       0    71
#> 7        69-disappeared           PJ    72  76.50000       2       0    69
#> 8        69-disappeared ALEX GOLDMAN    50 135.90000       5       0    69
#> 9      68-vampire-rules           PJ   142  88.00704       6       0    68
#> 10     68-vampire-rules ALEX GOLDMAN   117  73.16239      13       0    68
#> # ... with 104 more rows, and 1 more variables: nrow <int>

Number of words - seems PJ talks more, but didn’t do quantiative comparison

ggplot(data, aes(ep_no, n, colour = name)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_line(aes(group = ep_no), colour = "black") +
  scale_color_discrete(labels = c('Alex', 'PJ'))

Laughs per episode - take home: PJ laughs a lot

ggplot(data, aes(ep_no, n_laugh, colour = name)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_line(aes(group = ep_no), colour = "black") +
  scale_color_discrete(labels = c('Alex', 'PJ'))

Sentiment

zero <- which(vapply(txtsp, length, 1) == 0)
txtsp_ <- Filter(function(x) length(x) != 0, txtsp)

Tokenize words, and create data_frame

wordz <- stats::setNames(
  lapply(txtsp_, function(z) {
    bind_rows(
      if (is.null(try_tokenize(z$`ALEX GOLDMAN`))) {
        data_frame()
      } else {
        data_frame(
          name = "Alex",
          word = try_tokenize(z$`ALEX GOLDMAN`)
        )
      },
      if (is.null(try_tokenize(z$PJ))) {
        data_frame()
      } else {
        data_frame(
          name = "PJ",
          word = try_tokenize(z$PJ)
        )
      }
    )
  }), epnames[-zero])

Combine to single data_frame

(wordz_df <- bind_rows(wordz, .id = "episode"))
#> # A tibble: 104,713 × 3
#>       episode  name      word
#>         <chr> <chr>     <chr>
#> 1  73-sandbox  Alex      alex
#> 2  73-sandbox  Alex   goldman
#> 3  73-sandbox  Alex         i
#> 4  73-sandbox  Alex generally
#> 5  73-sandbox  Alex     don’t
#> 6  73-sandbox  Alex      alex
#> 7  73-sandbox  Alex    really
#> 8  73-sandbox  Alex      alex
#> 9  73-sandbox  Alex    groans
#> 10 73-sandbox  Alex        so
#> # ... with 104,703 more rows

Calculate sentiment using tidytext

bing <- sentiments %>%
  filter(lexicon == "bing") %>%
  select(-score)
sent <- wordz_df %>%
  inner_join(bing) %>%
  count(name, episode, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  ungroup() %>%
  filter(!is.na(episode)) %>%
  complete(episode, name) %>%
  mutate(ep_no = as.numeric(strextract(episode, "^[0-9]+")))
sent
#> # A tibble: 148 × 6
#>                                        episode  name negative positive
#>                                          <chr> <chr>    <dbl>    <dbl>
#> 1  1-an-app-sends-a-stranger-to-say-i-love-you  Alex       19       30
#> 2  1-an-app-sends-a-stranger-to-say-i-love-you    PJ       14       14
#> 3                         10-french-connection  Alex       15       32
#> 4                         10-french-connection    PJ       16       36
#> 5     11-did-errol-morris-brother-invent-email  Alex       NA       NA
#> 6     11-did-errol-morris-brother-invent-email    PJ       25       30
#> 7                           12-backend-trouble  Alex       20       15
#> 8                           12-backend-trouble    PJ       40       59
#> 9                              13-love-is-lies  Alex       NA       NA
#> 10                             13-love-is-lies    PJ       45       64
#> # ... with 138 more rows, and 2 more variables: sentiment <dbl>,
#> #   ep_no <dbl>

Names separate

ggplot(sent, aes(ep_no, sentiment, fill = name)) +
  geom_bar(stat = "identity") +
  facet_wrap(~name, ncol = 2, scales = "free_x")

Compare for each episode

ggplot(sent, aes(ep_no, sentiment, fill = name)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.5), width = 0.6)

Most common positive and negative words

Clearly, the word like is surely rarely used as a positive word meaning e.g., that they like something, but rather as the colloquial like, totally usage. So it’s removed.

Alex

sent_cont_plot(wordz_df, "Alex")

PJ

sent_cont_plot(wordz_df, "PJ")

video editing notes

This is how I edit videos of talks that I need to incorporate slides and video together

I’m on a Mac

  • import to iMovie (using v10 something)
  • drop movie into editing section
  • split pdf slides into individual files pdfseparate foobar.pdf %d.pdf
  • convert individual pdf slides into png sips -s format png --out "${pdf%%.*}.png" "$pdf"
  • import png’s into imovie
  • for each image, drop into editing area where you want it
  • when focused on the png of the slide:
    • select crop, then - choose fit, say okay
    • select “add as overlay” (very most left symbol), then choose picture in picture
    • then choose swap
    • then move inset to where you want it
    • say okay
  • rinse and repeat for all slides
  • export - via File option
  • share to youtube

e.g. of the result