Thursday, January 30, 2014

Introduction to dplyr: data manipulation made easy(er) and fun(er)

UPDATE 01/13/15: This post is not up to date with the most current version of dplyr.  

Major differences include a new pipe operator (i.e. all instances of %.%. below should be %>%). There are likely other differences as well. 

If you are just getting started in R, checkout my post on good references for beginners

Hadley Wickham has come out with yet another R package that is destined to improve my workflow and let me concentrate less on getting R to do things, and more on my research questions. The package is dplyr, a reboot of an earlier package called plyr.

Behind both packages is the notion that it should be easy to do split-apply-combine operations on your data. These operations are where you group your observations by some categorical variable, do the some operation on each subset, and then recombine results.  The plyr package was already really good at this.

From my perspective, the 2 most important improvements in dplyr are
  1. a MASSIVE increase in speed, making dplyr useful on big data sets

    1. the ability to chain operations together in a natural order
    Here is a quick example of how you can do complicated stuff with dplyr. Example data are from the PanTHERIA1 dataset of life history traits across mammals.

    First, we read in the data from the web.  This step takes the longest of anything we will do, because we are reading a 2.5 MB text file into memory over http. It is a huge dataset with 5416 observations of  55 variables

    Edit: make sure you are using the most recent version of dplyr (0.1.1), or else you may have issues with your R session crashing.

    URL <- ""
    pantheria <- read.table(file=URL,header=TRUE,sep="\t",na.strings=c("-999","-999.00"))

    Next we will set up some factors with human readable levels.  Note that our variables aren't named very concisely, but we will leave them for now.

    pantheria$X1.1_ActivityCycle <-       
    levels(pantheria$X1.1_ActivityCycle) <- 

    pantheria$X6.2_TrophicLevel <-
    levels(pantheria$X6.2_TrophicLevel) <- 

    OK.  Now we are ready to show off the magic of dplyr.  We will use the %.%. operator to chain together commands to manipulate our dataframe.  First, we use the mutate()function to create a new column called yearlyOffspring, which is a transformation of two other columns.  Then, we pass that result to the filter function, and filter out just the rodents. Next, we add a group_by() clause, and finally, we use summarise(), to calculate the average body mass for each group. Type ?manip in the command line to see the full list of dplyr manipulation functions.

    Activity_Trophic <-
    pantheria %.% 
        mutate(yearlyOffspring = X16.1_LittersPerYear 
                * X16.1_LittersPerYear) %.%
        filter(MSW05_Order == "Rodentia") %.%
        group_by(X1.1_ActivityCycle,X6.2_TrophicLevel) %.% 
        summarise(meanBM = mean(X5.1_AdultBodyMass_g,na.rm=TRUE), 
                meanYO = mean(yearlyOffspring,na.rm=TRUE))

    This code yields the following, which is exactly what we want!

    ## Source: local data frame [16 x 4] ## Groups: X1.1_ActivityCycle ## ## X1.1_ActivityCycle X6.2_TrophicLevel meanBM meanYO ## 1 nocturnal carnivore 77.89 11.125 ## 2 cathermeral carnivore 227.89 6.250 ## 3 diurnal carnivore 88.34 NaN ## 4 cathermeral omnivore 306.65 5.947 ## 5 diurnal herbivore 1088.98 9.289 ## 6 cathermeral NA 97.97 15.526 ## 7 NA carnivore 51.35 20.250 ## 8 cathermeral herbivore 2625.65 14.154 ## 9 nocturnal herbivore 1172.53 9.793 ## 10 diurnal NA 364.47 21.415 ## 11 nocturnal omnivore 542.24 9.741 ## 12 NA omnivore 392.19 5.079 ## 13 nocturnal NA 205.08 16.200 ## 14 NA herbivore 452.95 6.264 ## 15 diurnal omnivore 300.05 3.438 ## 16 NA NA 220.48 11.508

    The beauty of the %.% operator is that it allows you to do things in the order in which you think about them. You start with your data, then mutate it, then filter it, then group and summarise. You could do the same process with plyr, or with base apply-family functions, but dplyr makes it MUCH cleaner and clearer.  

    Now we can visualize this data, and observe that there is a complex relationship between body mass and reproductive output in rodents!

    qplot(data = Activity_Trophic, 
    x = log(meanBM), 
    y = meanYO, 
    size = I(5), 
    shape = X1.1_ActivityCycle) + 

    Please share your experiences with dplyr in the comments section.

    1Jones KE, Bielby J, Cardillo M, Fritz SA, O’Dell J, Orme CDL, Safi K, Sechrest W, Boakes EH,   Carbone C, et al. 2009. PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. Ecology 90:2648.