Wednesday, May 16, 2012

Quickly Visualize Your Whole Dataset


The tabplot package makes it extremely easy to visualize an entire data table with a single command. This is useful for exploratory analysis, to get a sense of how your data are structured. A quick example with the diamonds dataset from the ggplot2 package.

#load required packages
require(ggplot2)
require(tabplot)
# import data set
data(diamonds)
# make the plot
tableplot(diamonds)


plot of chunk unnamed-chunk-2

The result of tableplot() is a nice figure with all continuous variables as barcharts, and each categorical variable as a stacked barchart, showing the relative proportions for that category. Default binwitdh can be changed with the nBins = 100 argument, and you can change which column everything is sorted by by changing the sortCol = 1 argument.

The default plot is pretty informative here though. For instance….it is clear that carat is highly correlated with price, and with the x,y,z dimensions of the diamond. Also, it is clear that the diamonds with the highest clarity ratings (like VVS2 and VVS1) are much more common at the lower carat sizes.

Obviously, tableplot() is only a first step, but it is hard to beat for quickly getting a sense of what is happening in a dataset.

4 comments:

  1. Pretty! But I think it should be sortCol instead of sortcol.

    ReplyDelete
    Replies
    1. Nice catch, Kathryn! I fixed the typo.

      Delete
  2. Why does tabplot *require* sorting? I have time-series data which I'd like to visualize "as-is", in its *current* sort-order.

    To even speak of a sort order for this data is actually a misnomer -- the data has been imported from a csv file, and "sorting" is simply not needed or desired, and has not been done -- there is no single variable or column that I could sort on without disturbing the current *correct* order of these observations.

    Thx,
    -Sean [a rank, rank, RANK R-newbie, btw...]

    ReplyDelete
    Replies
    1. Hi Sean,

      I am not convinced that tabplot is a good choice for time-series data, because it does barcharts, and you probably want lines. Tabplot is good for when you are looking for correlated variables, and these are easier to see if you sort your data set.

      But if you you wanted to try it you could just create a new column that contains the sorting indices. Let's assume you have successfully read in your .csv file into a data.frame called 'myTS'. Then you could simply do

      myTS$sortCol <- 1:nrow(myTS)

      then try tableplot, sorting by the new column.

      Good luck,

      Andrew

      Delete