This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at http://randyzwitch.com. He’s blogged at Bad Hessian before here.

For those of you with WordPress blogs and have the Jetpack Stats module installed, you’re intimately familiar with this chart. There’s nothing particularly special about this chart, other than you usually don’t see bar charts with the bars shown superimposed.

I wanted to see what it would take to replicate this chart in R, Python and Julia. Here’s what I found. (download the data).

## R: ggplot2

Although I prefer to use other languages these days for my analytics work, there’s a certain inertia/nostalgia that when I think of making charts, I think of using ggplot2 and R. Creating the above chart is pretty straightforward, though I didn’t quite replicate the chart, as I couldn’t figure out how to make my custom legend not do the diagonal bar thing.

The R Cookbook talks about a hack to remove the diagonal lines from legends, so I don’t feel too bad about not getting it. I also couldn’t figure out how to force ggplot2 to give me the horizontal line at 10000. If anyone in the R community knows how to fix these, let me know!

(Pythonistas: I’m aware of the ggplot port by Yhat; functionality I used in my R code is still in TODO, so I didn’t pursue plotting with ggplot in Python)

## R: Base Graphics

Of course, not everyone finds ggplot2 to be easy to understand, as it requires a different way of thinking about coding than most ‘base’ R functions. To that end, there are the base graphics built into R, which produced this plot: While I was able to nearly replicate the WordPress chart (except for the feature of having the dark bars slightly smaller width than the lighter), the base R syntax is horrid. The abbreviations for plotting arguments are indefensible, the center and width keywords seem to shift the range of the x-axis instead of changing the actual bar width, and in general, the experience plotting using base R was the worst of the six libraries I evaluated.

## Python: matplotlib

In the past year or so, there’s been quite a lot of activity towards improving the graphics capabilities in Python. Historically, there’s been a lot of teeth-gnashing about matplotlib being too low-level and hard to work with, but with enough effort, the results are quite pleasant. Unlike with ggplot2 and base R, I was able to replicate all the features of the WordPress plot:

## Python: Seaborn

One of the aforementioned improvements to matplotlib is Seaborn, which promises to be a higher-level means of plotting data than matplotlib, as well as adding new plotting functionality common in statistics and research. Re-creating this plot using Seaborn is a waste of the additional functionality of Seaborn, and as such, I found it more difficult to make this plot using Seaborn than I did with matplotlib.

To replicate the plot, I ended up hacking a solution together using both Seaborn functionality and matplotlib in order to be able to set bar width and to create the legend, which defeats the purpose of using Seaborn in the first place.

In the Julia community, Gadfly is clearly the standard for plotting graphics. Supporting d3.js, PNG, PS, and PDF, Gadfly is built to work with many popular back-end environments. I was able to replicate everything about the WordPress graph except for the legend:While Gadfly took a line or two more than base R in terms of fewest lines of code, I find the Gadfly syntax significantly more pleasant to work with.

## Julia: Plot.ly

Plot.ly is an interesting ‘competitor’ in this challenge, as it’s not a language-specific package per-se. Rather, Plot.ly is a means of specifying plots using JSON, with lightweight Julia/Python/MATLAB/R wrappers. I was able to replicate nearly everything about the WordPress plot, with the exception of not having a line at 10000, having the legend vertical instead of horizontal and I couldn’t figure out how to set the bar widths separately.

## And The Winner Is…matplotlib?!

If you told me at the beginning of this exercise that matplotlib (and by extension, Seaborn) would be the only library that I would be able to replicate all the features of the WordPress graph, I wouldn’t have believed it. And yet, here we are. ggplot2 was certainly very close, and I’m certain that someone knows how to fix the diagonal line issue. I suspect I could submit an issue ticket to Gadfly.jl to get the feature added to create custom legends (and for that matter, make the request of Plot.ly for horizontal legends), so in the future there could be feature parity using these two libraries as well.

I hope we all agree there’s no hope for Base Graphics in R besides quick throwaway plots.

In the end, the best thing I can say from this exercise is that the analytics community is fortunate to have so many talented people working to provide these amazing visualization libraries. This graph was rather pedestrian in nature, so I didn’t even scratch the surface of what these various libraries can do. Even beyond the six libraries I chose, there are others I didn’t choose, including: prettyplotlib (Python), Bokeh (Python), Vincent (Python), rCharts (R), ggvis (R), Winston (Julia), ASCII Plots (Julia) and probably even more that I’m not even aware of! All free and open-source and miles apart from terrible looking Microsoft graphics in Excel and Powerpoint.

• Michael Corey

“…force ggplot2 to give me the horizontal line at 10000”– ylim(0,10000)
Try dropping the alphas of the colors (esp navy blue) to get closer the blending effect in the wordpress version.

• randyzwitch

Thanks Michael. If you add ylim(0,10000), it overwrites the scale_y_continuous(breaks=c(0,2000,4000,6000,8000,10000)) setting I used. So you do get the horizontal line at 10000, but not breaks every 2000 (trying this just now, it defaulted to major gridlines at 2500).

• Sorry about that. Try this:
… +
scale_y_continuous(breaks=c(0,2000,4000,6000,8000,10000), limits=c(0,10000)) +

• randyzwitch

I knew there had to be a simple fix.

• The two issues with ggplot2 can be fixed (https://gist.github.com/durtal/fa0d2bcefecb399faa31), there is probably an easier, more natural way.

I think the diagonal line in the legend is produced as a result of using “color=” inside “aes()”, and I think color affects the borders of geom_bar, so isn’t really needed. Create a named vector containing the two colors:
cols <- c('Views' = "#278DBC", 'Visitors' = 'navyblue')

The relevant name of each element in "cols" can used in the "fill=" argument in each "aes()":
geom_bar(data=visits_visitors, aes(x=Month, y=Views, fill='Views'), stat="identity")

Then the named vector can be passed as the values in "scale_fill_manual()":

scale_fill_manual(values=cols)

I had to use "element_blank()" in "legend.title" in theme(), as it takes the name of the first element in the named vector.

For the line at 10,000:
geom_hline(aes(yintercept=10000), color="gray")

• randyzwitch

Thanks for showing a solution for ggplot2 Tom! So it appears in this case, ggplot2 has feature parity to matplotlib (which I would’ve assumed anyway, but didn’t give credit in my post since I couldn’t figure it out 🙂 )

• randyzwitch

Follow-up from Michael Waskom (author of Seaborn library) on how he would approach plotting this chart:

• flutefreak7

II had a need to create overlapping bars but still utilize seaborn’s nice barplot capabilities and control the widths, so I wrote a context manager for shrinking bars’ widths by some percentage. It basically finds the new Rectangles on the current axis and shrinks them upon leaving the with block.

from contextlib import contextmanager

@contextmanager
def bar_plot_width(percent_width):
ax = plt.gca()
before = set(ax.get_children())
yield
after = set(ax.get_children())
for c in after – before:
if isinstance(c, matplotlib.patches.Rectangle):
(x, y), w = c.xy, c.get_width()
c.set_width(percent_width * w)
c.set_x(x + (1-percent_width)/2*w)

In my case I’m comparing product variations from different lots of material (the dark blue and dark green bars) to the allowable range for each parameter (the light blue).

• thelatemail

The base R plot could be shortened and simplified a lot:

bp <- barplot(rep(NA,length(visits_visitors\$Views)), ylim = c(0,10000), axes=FALSE)
abline(h=seq(2000,10000, 2000), col='lightgray')
axseq <- seq(5,nrow(bp),5)
axis(1,at=bp[axseq],labels=visits_visitors\$Month[axseq],lty=0,col.axis="lightgray")
axis(2,at=axTicks(2),col.axis="lightgray",lty=0,las=2)
barplot(visits_visitors\$Views, col = "#278DBC", border = NA, add = TRUE, axes=FALSE)
barplot(visits_visitors\$Visitors, col = "navyblue", border = "#00000000", add = TRUE, axes=FALSE)
legend("topright", c("Views", "Visitors"), ncol=2, fill=c("#278DBC","navyblue"), bty="n", border=FALSE)

http://i.imgur.com/2w5APi9.png

• Guest

Just to mention, Bokeh has preliminary MPL support though Jake Vanderplas’ mplexporter. As SciPy the MPL team stated their intentions to work on a proper JSON ingest/export serialization format for MPL so that all the new browser toolkits (Bokeh, mpld3, plot.ly) can interoperate even better with MPL. My intention, for instance, is to use MPL as a backend for static image generation for Bokeh, in addition to being able to render MPL code easily as Bokeh plot. MPL is fantastic but it is made even better with all the new tools that will be integrating with it.

• Jacob Westfall

I certainly don’t agree that “there’s no hope for Base Graphics in R besides quick throwaway plots.” At least for the R contenders, I think if anything you have it quite backwards; Hadley himself is on record as saying that ggplot2 is intended as a graphical data exploration tool, and that detailed customization for publication-quality plots is often best done in base R. For the present example, note this plot can be reproduced pretty much exactly in base R, and in a lot less code, by among other things not using barplot(). barplot() is a function that draws a series of rectangles. But series of rectangles can be drawn more easily and with more control using… rect(). Code (with made up data) printed below and plot attached.

views <- runif(30, 2500, 9000)
visitors <- tail(views, 20)*runif(20)
plot(y=c(0,10000), x=c(0,30), cex=0, xaxt="n", xlab="", ylab="", las=1, tick=F, bty="n")
abline(h=seq(0,10000, 2000), col='lightgray')
axis(side=1, at=seq(5,30,5)-.5, tick=F, labels=c("Jun-12","Nov-12","Apr-13","Sep-13","Feb-15","Jul-14"))
legend("topright", fill = c("#278DBC", "navyblue"), bty='n', ncol=2, border=FALSE, legend=c("Views","Visitors"))
rect(ytop=views, ybottom=0, xleft=1:30-.9, xright=1:30-.1, col="#278DBC", border=NA)
rect(ytop=visitors, ybottom=0, xleft=11:30-.75, xright=11:30-.25, col="navyblue", border=NA)

• thelatemail

Nicely done.

Exactly. Base plotting is extremely flexible and quite modular. You can call all-at-once functions like barplot; partially pre-fabbed sub-plot elements like axis and legend; and basic building blocks like lines, segments, rect, points, polygon, text, mtext etc etc if you require fine control.
If you want a plot to be absolutely customisable, it’s hard to go past base functions.

• Barry Rowlingson

With ggplot, when trying to replicate a plot you *will* hit brick walls. The only way through that brick wall is then to write a new geom_ function, or a new stat_ function, or some other thing. As someone who has tried, its not easy. Base graphics is not a charting system, its essentially a vector and raster drawing system with a few charting routines pre-built for you. As such, it can be made to replicate your plot to pixel-perfect accuracy if that’s what you want.

• Bryan Van de Ven

Just to mention, Bokeh has preliminary MPL support already, using Jake Vanderplas’ mplexporter. At SciPy the MPL team stated their intentions to work on a proper JSON ingest/export serialization format for MPL so that all the new browser toolkits (Bokeh, mpld3, plot.ly) can interoperate even better with MPL. My intention, for instance, is to use MPL as a backend for static image generation for Bokeh (re-using the many man-years of work that has gone into that, instead of reinventing the wheel), in addition to being able to render MPL code easily as Bokeh plot. MPL is fantastic but it is made even better with all the new tools that will be integrating with it.

• Guest

Just to mention, Bokeh has preliminary MPL support though Jake Vanderplas’ mplexporter. As SciPy the MPL team stated their intentions to work on a proper JSON ingest/export serialization format for MPL so that all the new browser toolkits (Bokeh, mpld3, plot.ly) can interoperate even better with MPL. My intention, for instance, is to use MPL as a backend for static image generation for Bokeh, in addition to being able to render MPL code easily as Bokeh plot. MPL is fantastic but it is made even better with all the new tools that will be integrating with it.

• Francois Tonneau

A belated remark: If you want easy syntax with quality results, I do not think anything could beat GLE (Graphics Layout Engine). Here is the target figure in GLE, along with the (uncommented) code:

``` size 18 4```

``` set font ss begin graph xaxis min 0.5 max 30.5 ftick 0 dticks 5 nofirst yaxis min 0 max 10000 ftick 0 dticks 2000 grid color #ccccee side off x2axis off xticks off labels color black dist 0.18 hei 0.35 data "visits.dat" bar d1 width 0.85 color #278dbc fill #278dbc bar d2 width 0.50 color #000099 fill #000099 end graph set hei 0.30 just tl amove 3.0 3.22 box 0.2 0.2 just tl nobox fill #278dbc rmove 0.33 0; write "Views" ```

```amove 4.5 3.22 box 0.2 0.2 just tl nobox fill #000099 rmove 0.33 0; write "Visitors" ```

(Because GLE is picky with data files, the format of the data file must be slightly changed: all missing values must be replaced with an asterisk).