Wednesday, February 2, 2011

Fixing Up smoothScatter Heat Maps

A while back, I posted an article using the smoothScatter function in R that builds a color representation of density for scatter plots. When I first found the function, I was extremely excited because it's a very easy and automated way to make a heat map! Unfortunately, the more I messed with the function, the more annoying it became. But that's not to say it doesn't produce very very pretty pictures.

I've had a lot of inquiries regarding this function lately, as Harry Pavlidis at THT, Dave Allen at Fangraphs, and Chris Quick at Bay City Ball have implemented it in recent articles elsewhere based on my original code. However, there are some problems with the function: it automatically chooses the range for the data to be plotted.

Now, it absolutely should pick how far out to extrapolate a kernel smoother (it's generally not a good idea to ever go outside the bounds of the data). However, the ability to control the plotting is a bit wonky. In the case of this function, it chooses the axes in a way that is often off-center or not comparable across different data with different ranges. This is a key attribute needed for plotting Pitch F/X data. I've tried using the xlim and ylim options, but this unfortunately makes things worse. If you use these within smoothScatter, it just leaves a bunch of white background beyond where the function chooses to smooth the data. See below for the problems we can run into:

Whitespace:

Off-center:


Chris and others have inquired about this, and I found a few fixes...none of which are great, but I don't think there are any other options.


Option 1:
Create a color palette in which the color representing the lowest density is white.

For this, when you indicate the colors to be used for smoothing in colramp=colRampPalette=c("col1", "col2"....), your first entry should be "white". Choose carefully, as a white background usually works best with a single color or group of similar colors (i.e. Red Only, Blue Only, Red/Orange). This works okay, but I don't think it looks quite as nice as having a darker background. A darker background really makes things 'pop'. Below I have a Bruce Froemming "Called Strike" and "Called Ball" pitch density map by location using this white background using an all-red palette:



Option 2:
Use par(bg="") just before smoothScatter.

This option works as long as you indicate the color for "bg" (means background) to be the first color in your smoothing palette. This way, you can set your axes the way you want, and everything that is white from before will just be filled in essentially as zero density. Unfortunately, this also colors the background beyond your axes and into your plot title and axis labels. This is certainly not optimal, but if you use the right colors it may not turn out too bad. Notice how dark things look even with a Red palette:



Option 3:
Use rect() to draw filled rectangles in the ranges where the function does not fill when you custom-set your axes.

This is the most flexible option. Unfortunately, it involves some guess-and-check to be sure you don't overlap your rectangles on top of areas where there is some pitch density. This isn't as easy as it sounds, and sometimes is impossible (especially if there is some density of pitches near the edges of what smoothScatter chose to plot). For this method, we use the "rect()" function and indicate "col=" within the function to tell it what to fill the rectangle, as well as "border=" to indicate the border to be the same color. See if you can tell where the rectangles begin at the edges of the plot below. In some places you can see evidence of a line that overlapped where I would rather it didn't:


Finally, an extra suggestion: use the "bandwidth=" option in your smoothScatter plots. I had not bothered with this on my first run with the function, and it uses and automatic bandwidth chosen by the "bk2de" function it calls from. For the data I've worked with, 0.20 or 0.25 works relatively well. Of course it depends on your data and what you want out of the plot to determine the optimal smoothing really is.

That's all I've got for now. I wanted to get this up to help people out a little bit, but I have to get back to my work (they expect me to finish this dissertation at some point, I guess). I really think this function makes some of the best looking heat maps out there, I just wish there was a little more customization possible with it. Good luck!

And for good measure, here is my original color scheme that I really love. Just not sure I like the background of everything to be so dark:



Addition: See the comment section for another suggestion by Dave Armstrong. His solution is far easier. I had tried this before, but ran into problems when I forgot to include "add=T" to the parameters within smoothScatter. There are still distinct edges to the image, though, and I'm going to try and see if I can fix things up within the function myself. (Don't expect too much from me on that part!)

Addition 2: Dave beat me to the punch and fixed up some inner workings of the function. I want to thank him for his help. This is why using R for research and analysis is great: there is a huge support system everywhere! And there is always something new to learn.

8 comments:

  1. Again, thanks a lot. I'm looking forward to getting home later and playing around with R some more. Your posts have been terrific!

    ReplyDelete
  2. You could do the following, where you initialize a plot, make a polygon whose bounds are wider (in both directions) than the limits and whose color is the coolest color in the defined palette (below, it is buylrd[1]). Then, add the parameter "add=T" to the smoothScatter command, which will drop the results of that function on top of the previously initialized plot. The first bit of code below creates some data. The rest follows closely along the post you linked to above (http://princeofslides.blogspot.com/2010/09/heat-map-and-pitch-fx.html)

    library(MASS)

    x <- mvrnorm(1000,
    c(.5,2.5),
    matrix(c(1,.1,.1,1),
    ncol=2))

    library(RColorBrewer)

    buylrd <- rev(brewer.pal(11, "RdYlBu"))

    plot(0,1,
    xlim=c(-4,4), ylim=c(-2,6),
    xlab="x", ylab="y",
    type="n")

    polygon(
    x=c(-5,5,5,-5),
    y=c(-3,-3,7,7),
    col=buylrd[1])

    smoothScatter(x,
    colramp=colorRampPalette(c(buylrd)),
    pch="", nbin=250, add=T)

    ReplyDelete
  3. Dave,

    Thanks! I definitely tried something like this with a rectangle using the bounds as the inner portion of the graphic, but stupidly enough I didn't use the "add=T" option, and it just redrew the smoothScatter right over it. Woops.

    This is a much easier solution. Unfortunately, there are still distinct edges to the image created by smoothScatter as it has in the plots above. It's not a huge deal as there isn't much density out there, but it messes slightly with the aesthetics.

    I'm trying to fiddle with the inner-workings of the function to see if I can fix it up, but thus far have not been able to get it to work.

    Also, thanks for the "rev()" function. I knew there had to be an easier way to reverse the RColorBrewer palette!

    ReplyDelete
  4. I understand that this post is about improving the use of the smoothScatter function, but why is this considered superior to using the kde2d function and then using filled.contour to graph the output?

    ReplyDelete
  5. Not sure I'd say it's certainly superior. I think both have their merits, and I use filled.contour for when I'm not doing a density estimation (i.e. strike zone plots, run values, etc.).

    The one thing I like about this function is the cleanness of the smoothing. I like the aesthetics of the color blending. The advantage to the filled.contour is the automated key and flexibility for plotting different types of functions.

    Also, assuming things would plot to the area needed correctly, this function is a much more direct way to plot, rather than creating your own sequence/matrix of value to plot (i.e. ease of use for those less familiar with R). A while back I had some trouble getting Dave Allen's code for filled.contour to work correctly, so I resorted to this function.

    In the end, you are correct, as the likes of filled.contour and levelplot (in the lattice package) are more flexible.

    BTW, I like the plots and work at your site.

    ReplyDelete
  6. Almost forgot. One advantage with this function over the standard filled.contour is that two plots can be generated in the same window side-by-side. That's not a huge deal, but certainly makes comparisons easier.

    ReplyDelete
  7. Thanks, glad you like the site.

    So I suppose in the end there it's just a matter of preference. It certainly is nice to create these plots with just one function.

    ReplyDelete
  8. This type of map should show increasing and decreasing heat with increasing accessibility to other destinations in aggregate.

    ReplyDelete