Wednesday, May 16, 2012

Quickly Visualize Your Whole Dataset


The tabplot package makes it extremely easy to visualize an entire data table with a single command. This is useful for exploratory analysis, to get a sense of how your data are structured. A quick example with the diamonds dataset from the ggplot2 package.

#load required packages
require(ggplot2)
require(tabplot)
# import data set
data(diamonds)
# make the plot
tableplot(diamonds)


plot of chunk unnamed-chunk-2

The result of tableplot() is a nice figure with all continuous variables as barcharts, and each categorical variable as a stacked barchart, showing the relative proportions for that category. Default binwitdh can be changed with the nBins = 100 argument, and you can change which column everything is sorted by by changing the sortCol = 1 argument.

The default plot is pretty informative here though. For instance….it is clear that carat is highly correlated with price, and with the x,y,z dimensions of the diamond. Also, it is clear that the diamonds with the highest clarity ratings (like VVS2 and VVS1) are much more common at the lower carat sizes.

Obviously, tableplot() is only a first step, but it is hard to beat for quickly getting a sense of what is happening in a dataset.