A Data Visualization Book

Note: thanks to Scott for inviting me to contribute to the Recology blog despite being an ecology outsider; my work is primarily in atomic physics. -Pascal

A part of me has always liked thinking about how to effectively present information, but until the past year, I had not read much to support my (idle) interest in information visualization. That changed in the spring when I read Edward Tufte’s The Visual Display of Quantitative Information, a book that stimulated me to think more deeply about presenting information. I originally started with a specific task in mind–a wonderful tool for focusing one’s interests–but quickly found that Tufte’s book was less a practical guide and more a list of general design principles. Then, a few months ago, I stumbled upon Nathan Yau’s blog, FlowingData, and found out he was writing a practical guide to design and visualization. Conveniently enough for me, Yau’s book, Visualize This, would be released within a month of my discovery of his blog; what follows are my impressions of Visualize This.

I have liked Visualize This a lot. Yau writes with much the same informal tone as on his blog, and the layout is visually pleasing (good thing, too, for a book about visualizing information!). The first few chapters are pretty basic if you have done much data manipulation before, but it is really nice to have something laid out so concisely. The examples are good, too, in that he is very explicit about every step: there is no intuiting what that missing step should be. The author even acknowledges in the introduction that the first part of the book is at an introductory level.

Early in the book, Yau discusses where to obtain data. This compilation of sources is potentially a useful reference for someone, like me, who almost always generates his own data in the lab. Unfortunately, Yau does not talk much about preparation of (or best practices for) your own data. Additionally, from the perspective of a practicing scientist, it would have been nice to hear about how to archive data to make sure it is readable far into the future, but that is probably outside the scope of the book.

Yau seems really big into using open source software for getting and analyzing data (e.g. Python, R, etc…), but he is surprisingly attached to the proprietary Adobe Illustrator for turning figures into presentation quality graphics. He says that he feels like the default options in most analysis programs do not make for very good quality graphics (and he is right), but he does not really acknowledge that you can generate nice output if you go beyond the default settings. For me, the primary advantage of generating output programmatically is that it is easy to regenerate when you need to change the data or the formatting on the plot. Using a graphical user interface, like in Adobe Illustrator, is nice if you are only doing something once (how often does that happen?), but when you have to regenerate the darn figure fifty times to satisfy your advisor, it gets tedious to move things around pixel by pixel.

By the time I reached the middle chapters, I started finding many of the details to be repetitive. Part of this repetition stems from the fact that Yau divides these chapters by the type of visualization. For example, “Visualizing Proportions” and “Visualizing Relationships” are two of the chapter titles. While I think these distinctions are important ones for telling the right story about one’s data, creating figures for the different data types often boils down to choosing different functions in R or Python. People with less analysis and presentation experience should find the repetition helpful, but I increasingly skimmed these sections as I went along.

Working through Yau’s examples for steps you do not already know would probably be the most useful way of getting something out of the book. So, for example, I started trying to use Python to scrape data from a webpage, something I had not previously done. I followed the book’s example of this data-scraping just fine, but as with most things in programming, you find all sorts of minor hurdles to clear when you try your own thing. In my case, I am re-learning the Python I briefly learned about 10 years ago–partly in anticipation of not having access to Matlab licenses once I vacate the academy–since I have forgotten a lot of the syntax. A lot of this stuff would be faster if I were working in Matlab which I grew more familiar with in graduate school.

Overall, Visualize This is a really nice looking book and will continue to be useful to me as a reference. Yau concludes his book with a refreshing reminder to provide context for the data we present. This advice is particularly relevant when presenting to a wider or lay audience, but it is still important for us, as scientists, to clearly communicate our findings in the literature. Patterns in the data are not often self-evident, and therefore we should think carefully about which visualization tools will best convey the meaning of our results.

Note: Edited to add a link to Visualize This here and in the introductory paragraph.