Thursday, January 30, 2014

Introduction to dplyr: data manipulation made easy(er) and fun(er)

If you are just getting started in R, checkout my post on good references for beginners



Hadley Wickham has come out with yet another R package that is destined to improve my workflow and let me concentrate less on getting R to do things, and more on my research questions. The package is dplyr, a reboot of an earlier package called plyr.

Behind both packages is the notion that it should be easy to do split-apply-combine operations on your data. These operations are where you group your observations by some categorical variable, do the some operation on each subset, and then recombine results.  The plyr package was already really good at this.

From my perspective, the 2 most important improvements in dplyr are
  1. a MASSIVE increase in speed, making dplyr useful on big data sets

    1. the ability to chain operations together in a natural order
    Here is a quick example of how you can do complicated stuff with dplyr. Example data are from the PanTHERIA1 dataset of life history traits across mammals.

    First, we read in the data from the web.  This step takes the longest of anything we will do, because we are reading a 2.5 MB text file into memory over http. It is a huge dataset with 5416 observations of  55 variables

    Edit: make sure you are using the most recent version of dplyr (0.1.1), or else you may have issues with your R session crashing.

    require(dplyr)
    URL <- "http://esapubs.org/archive/ecol/e090/184/PanTHERIA_1-0_WR05_Aug2008.txt"
    pantheria <- read.table(file=URL,header=TRUE,sep="\t",na.strings=c("-999","-999.00"))

    Next we will set up some factors with human readable levels.  Note that our variables aren't named very concisely, but we will leave them for now.

    pantheria$X1.1_ActivityCycle <-       
               factor(pantheria$X1.1_ActivityCycle)
    levels(pantheria$X1.1_ActivityCycle) <- 
               c("nocturnal","cathermeral","diurnal")

    pantheria$X6.2_TrophicLevel <-
               factor(pantheria$X6.2_TrophicLevel)
    levels(pantheria$X6.2_TrophicLevel) <- 
               c("herbivore","omnivore","carnivore")

    OK.  Now we are ready to show off the magic of dplyr.  We will use the %.%. operator to chain together commands to manipulate our dataframe.  First, we use the mutate()function to create a new column called yearlyOffspring, which is a transformation of two other columns.  Then, we pass that result to the filter function, and filter out just the rodents. Next, we add a group_by() clause, and finally, we use summarise(), to calculate the average body mass for each group. Type ?manip in the command line to see the full list of dplyr manipulation functions.

    Activity_Trophic <-
    pantheria %.% 
        mutate(yearlyOffspring = X16.1_LittersPerYear 
                * X16.1_LittersPerYear) %.%
        filter(MSW05_Order == "Rodentia") %.%
        group_by(X1.1_ActivityCycle,X6.2_TrophicLevel) %.% 
        summarise(meanBM = mean(X5.1_AdultBodyMass_g,na.rm=TRUE), 
                meanYO = mean(yearlyOffspring,na.rm=TRUE))

    This code yields the following, which is exactly what we want!

    ## Source: local data frame [16 x 4] ## Groups: X1.1_ActivityCycle ## ## X1.1_ActivityCycle X6.2_TrophicLevel meanBM meanYO ## 1 nocturnal carnivore 77.89 11.125 ## 2 cathermeral carnivore 227.89 6.250 ## 3 diurnal carnivore 88.34 NaN ## 4 cathermeral omnivore 306.65 5.947 ## 5 diurnal herbivore 1088.98 9.289 ## 6 cathermeral NA 97.97 15.526 ## 7 NA carnivore 51.35 20.250 ## 8 cathermeral herbivore 2625.65 14.154 ## 9 nocturnal herbivore 1172.53 9.793 ## 10 diurnal NA 364.47 21.415 ## 11 nocturnal omnivore 542.24 9.741 ## 12 NA omnivore 392.19 5.079 ## 13 nocturnal NA 205.08 16.200 ## 14 NA herbivore 452.95 6.264 ## 15 diurnal omnivore 300.05 3.438 ## 16 NA NA 220.48 11.508

    The beauty of the %.% operator is that it allows you to do things in the order in which you think about them. You start with your data, then mutate it, then filter it, then group and summarise. You could do the same process with plyr, or with base apply-family functions, but dplyr makes it MUCH cleaner and clearer.  

    Now we can visualize this data, and observe that there is a complex relationship between body mass and reproductive output in rodents!

    require(ggplot2)
    qplot(data = Activity_Trophic, 
    x = log(meanBM), 
    y = meanYO, 
    size = I(5), 
    shape = X1.1_ActivityCycle) + 
    theme_bw(15)




    Please share your experiences with dplyr in the comments section.


    References:
    1Jones KE, Bielby J, Cardillo M, Fritz SA, O’Dell J, Orme CDL, Safi K, Sechrest W, Boakes EH,   Carbone C, et al. 2009. PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. Ecology 90:2648.

    Thursday, January 2, 2014

    Discriminant Function Analysis and Phylogenetic Signal

    Rob Scott and I just published a new paper in AJPA on the issues surrounding the use of Discriminant Function Analysis (DFA) in ecomorphology. 

    2013 - Barr, WA and Scott, RS. Phylogenetic comparative methods complement discriminant function analysis in ecomorphology. American Journal of Physical Anthropologydoi:10.1002/ajpa.22462

    Ecomorphology uses anatomical characteristics to predict the ecological context in which an organism lived. This is possible because organisms adapt anatomically to the functional requirements of their lifestyles. However, ecomorphology may be complicated by the fact that both morphological and ecological traits tend to have phylogenetic signal. In other words, closely related species tend to be more similar than more distantly related species. This can make it difficult to tease apart the effects of functional adaptation from those of phylogenetic signal. 

    One of the most common statistical methods in ecomorphology is DFA.  The purpose of our study was to evaluate the performance of DFA in situations with varying levels of phylogenetic signal. 

    We used phylogenetic simulations to create datasets that were related to a phylogenetic tree, but were functionally unrelated to a set of ecological characteristics, which had varying levels of phylogenetic signal.  We simulated data in which (1) both the morphological characters and ecological categories had phylogenetic signal, (2) only the morphological characters had phylogenetic signal, (3) only the ecological category had phylogenetic signal, and (4) when neither the morphology nor the category had phylogenetic signal. 

    Remember: in all cases there was no biomechanical connection between habitat and morphology. We then ran DFA on the resulting datasets. The results are summarized in the figure below. 

    This figure shows the mean success rates of DFA on the vertical axis, and % of DFAs that were significant on the horizontal axis.  When we randomized habitats, DFAs were rarely significant. However, when the actual habitats (with phylogenetic signal) were used, the DFAs are very often statistically significant in cases where the morphological variables have phylogenetic signal.  We used Phylogenetic Generalized Least Squares (PGLS) on these same datasets, and found that PGLS reliably rejects the hypothesis of a biomechanical link between category and morphology.

    Thus, we concluded that PGLS should be used to validate characters before including them in DFA.  
      

    Tuesday, November 19, 2013

    musings on academic publishing

    I am in the process of revising two submitted papers that come from my dissertation research.  Without further ado, I give you my musings on academic publishing.



    Wednesday, November 13, 2013

    faster for() loops in R

    One of the most common ways people write for() loops is to create an empty results vector and then concatenate each result with the previous (and growing) results vector, like the following.  (Note: wrapping an expression in the function system.time() executes the function and returns a summary of how long it took, in seconds.)

    x <- c()
    system.time(
      for(i in 1:40000){
        x<-c(x,i) #here i is combined with previous contents of x
      }
    )

       user  system elapsed 
      2.019   0.082   2.100 

    It is MUCH faster to create the results an empty vector of the correct size, and modify elements in place.  This prevents R from having to move around an ever growing object in memory and is much faster. In short....it seems that what R is slow at is allocating memory for objects.

    x<-numeric(40000) #empty numeric vector
    system.time(
      for(i in 1:40000){
        x[i] <- i #changing value of particular element of x
      }
    )
       user  system elapsed 
      0.066   0.001   0.067 

    The second method is over 31 times faster on my machine.

    PS.  This post was inspired by Hadley Wickham's much more technical and in-depth coverage of memory usage in R.

    Friday, October 18, 2013

    New Dmanisi cranium increases variation, questions early Homo species diversity

    The team from the site of Dmanisi in Georgia just published the description of their "Skull 5", which represents the fifth well preserved skull from this important 1.8 million year old site from the Republic of Georgia.

    Skull 5: credit  Georgia National Museum, AP
    This skull is super important, because it is quite different from the other four Dmanisi skulls in that is has a massive face and jaw, and just a tiny brain at about 550cc. Thus, the authors argue that the Dmanisi sample from a single place and a relatively short time interval encompasses virtually all of the variation found in early Homo specimens from Africa. This would imply that the proposed species diversity in early Homo is not really species diversity at all, unless you want to split the Dmanisi specimen into multiple species.

    The specimen is getting TONS of well-deserved press attention, but much of the coverage is predictably sensationalist....claiming that this skull overturns what we thought we knew about the human family tree.  A more accurate description might be that this skull offers fresh evidence which is lending new support to a long-standing idea in paleoanthropology (i.e. that early Homo is characterized best as one species).

    Saturday, September 7, 2013

    Juvenile ape cranium from Miocene of China

    Credit: Xue-Ping Ji, Yunnan Institute of Cultural Relics and Archaeology

    A team of researchers working at Shuitangba, a site in the Miocene of China, just announced a pretty awesome juvenile cranium of the Miocene ape genus Lufengpithecus.  This skull is important because of its relatively young age (about 6 million years), and because it doesn't closely resemble orangutans. Many researchers have considered Lufengpithecus to be closely related to modern orangutans, but if true, then Lufengpithecus should bear a striking resemblance to modern orangs by 6 million years ago. The authors argue that this isn't the case, and that Lufengpithecus doesn't appear similar to any modern apes.  Very cool fossil!

    Monday, September 2, 2013

    Academic Phylogeny of Physical Anthropology


    Liza Shapiro, Brett Nachman and I have just launched a new website called Academic Phylogeny of Physical Anthropology. We are creating an interactive online genealogy of physical anthropology PhDs.  The idea was hatched over lunch in the department. We were discussing the great paper by Elizabeth Kelley and Robert Sussman in which they trace the academic genealogy of field primatologists. We were lamenting the fact that there wasn't a comparable tree for other sub-specializations within physical anthropology. We decided that tracing academic history is important, and that creating an online version driven by user submissions was the best way to build out the tree.

    Please check out the site.  If you are a physical anthropology PhD (or know somebody who is), make sure their name appears. If it doesn't appear, be sure to add it!

    Tuesday, June 18, 2013

    Early human evolution and the myth of the Paleo Diet

    hominin looking out on savanna

    One of the most visible fad diets lately (at least where I live) is the so-called Paleo Diet.  If you haven't heard of it, you can think of the Paleo Diet as pop evolutionary psychology for foodies. Essentially, the idea is that our bodies are adapted to the diet of our "pre-agricultural, hunter gatherer ancestors", and that we should try to mimic this diet as much as possible, because it is more natural.  

    The basic idea is sensible enough, right? I mean, to the extent that the diet recommends doing away with the heavily processed carbs which are modern technological innovations, I think we can all get on board (leaving aside critical questions regarding poverty and food access for the moment).  However, the picture gets a little less clear when we dig deeper into questions like "which ancestors are we talking about" and "what did those ancestors actually eat"? That's where the science comes in.  

    caveman eating burger and fries

    In the June 3rd issue of PNAS, there were three important papers on the stable isotopic evidence for early hominin diet, especially hominins in East Africa. These are dense technical papers that provide invaluable direct evidence regarding the diets of our early ancestors.  I won't try to summarize these papers, I just want to pick out three broad points that they drive home for me:
    1. Apes eat mostly ripe fruits and leafy greens, but human ancestors shifted their diets away from ape-like diets very early on.
    2. The diets of human ancestors are notably variable when compared with other animals.  
    3. Some foods that our ancestors have probably eaten for millions of years would be forbidden by the so-called Paleodiet. 
    Point 1: While the early hominin species A. anamensis had a carbon isotope signature dominated by C3 resources (Cerling et al, 2013), it is clear that by the time A. afarensis (Lucy's species) came around, they were eating lots of C4 (Wynn et al, 2013).  This is interesting, because it means that a dietary transition away from ape-like diets dominated by C3 vegetation occurred really early in human evolution (around 3.4 Ma) in a species that most researchers agree is directly ancestral to us.  This dietary shift seems really interesting when you consider that even chimpanzees that routinely live in open savanna environments don't eat much C4 vegetation (Schoeninger et al, 1999).  Sponheimer and colleagues (2013) verify that this shift occurred in multiple species through time, and that there is a weak trend towards more C4 as time goes on.  

    Point 2: Human ancestors had remarkably varied diets!  Wynn and colleagues (2013) show that different A. afarensis individuals had isotope values that ranged from the neighborhood of committed browsers nearly (but not quite) up to that of committed grazers. Clearly early hominins were eating lots of different things.  This is consistent with a previous study suggesting that different species of the genus Paranthropus were eating remarkably different diets, even though their jaw anatomy is extremely similar.

    Point 3: Stable isotope analysis can't tell us exactly which C4 plants hominins were eating.  But, a likely candiate C4 food item for early humans are underground storage organs of certain C4 plants. These starchy roots and tubers would be gritty and fibrous, but would include ample carbohydrates and water.  The idea that  early hominins relied on these so-called underground storage organs (USOs) goes way back in paleoanthropology (Hatley and Kappelman, 1980) and had a bit of a revival in the last decade (Laden and Wrangham, 2005).  It is ironic that practitioners of the paleodiet swear off starchy tubers like potatoes when they have likely been an important part of hominin diets for the last 3.4 million years!!!

    Conclusion: It is really hard to know what our earliest ancestors ate. The best science is starting to paint a picture, though and it is clear that leaving behind the ape-like diet of leafy greens and fruits was an integral part of human evolution from very early days.

    If we were to consider recent ancestors (say in the last 50,000 years), it would be clear that hunting and gathering human populations have made a living from every kind of diet you can imagine (think near vegetarians on one extreme and eating tons of whale blubber on the other hand)! It is far from clear what it means to "eat like a caveman" and this idea has much more to do with selling diet books than it does with the science of figuring out what our ancestors actually ate. Thankfully, we have lots of careful scientists doing the difficult task of figuring it out!

    References:


    Hatley T, and Kappelman J. 1980. Bears, pigs, and Plio-Pleistocene hominids:a case for the exploitation of belowground food resources. Human Ecology 8:371–387.

    Laden G, and Wrangham R. 2005. The rise of the hominids as an adaptive shift in fallback foods: Plant underground storage organs (USOs) and australopith origins. Journal of Human Evolution 49:482–498.

    Schoeninger MJ, Moore J, and Sept JM. 1999. Subsistence strategies of two“ savanna” chimpanzee populations: the stable isotope evidence. American Journal of Primatology 49:297–314.


    Wednesday, April 17, 2013

    New edition of Fleagle's "Primate Adaptation and Evolution"

    At last!  There is a beautifully updated version of Fleagle's seminal Primate Adaptation and Evolution.  Just took my first look, and it appears to include all of the many important fossil discoveries since last edition.  Also, there are many new images, and the format is larger, meaning that all the images are much larger.  This is great!!

    Tuesday, April 16, 2013

    Do the same thing to a bunch of variables with lapply()


    It is extremely common to have a dataframe containing a bunch of variables, and to do the exact same thing to all of these variables.

    For instance, lets say we have a dataframe that has a bunch of limb bone measurements of different animals, and we want to see if they are related to a categorical predictor variable after controlling for the body mass of the animal.


    set.seed(500)

    categories <- factor(rep(c("A","B","C"),33))
    BM <- rnorm(99,mean=100,sd=15)

    myData<-data.frame(categories = categories,
                  BM = BM,
                  var1 = 0.05 * BM + as.numeric(categories) + rnorm(99,sd=1),
                  var2 = 0.1 * BM + as.numeric(categories) + rnorm(99,sd=1),
                  var3 = 0.2 * BM + as.numeric(categories) + rnorm(99,sd=1),
                  var4 = 0.4 * BM + as.numeric(categories) + rnorm(99,sd=1),
                  var5 = 0.9 * BM + as.numeric(categories) + rnorm(99,sd=1),
                  var6 = 0.99 * BM + as.numeric(categories) + rnorm(99,sd=1)
                  )

    rm(categories)
    rm(BM)
    head(myData)


    ## categories BM var1 var2 var3 var4 var5 var6 ##1 A 114.53 4.769 12.23 23.87 46.29 104.91 114.57 ##2 B 129.48 9.345 14.21 25.78 55.14 117.26 129.48 ##3 C 113.29 9.604 13.26 25.09 48.38 105.23 116.08 ##4 A 100.46 6.600 12.77 22.17 41.95 93.13 100.06 ##5 B 114.24 7.857 13.86 23.52 46.84 104.18 114.92 ##6 C 91.35 7.289 11.64 22.45 41.04 83.08 92.77


    Plotting all our variables against body mass - the long way


    We have created 6 variables that are all correlated with our 3-level categorical variable.  They also have an increasing correlation with body mass, which we can see in a plot.  Your first inclination might be to set up a plotting space with room for 6 plots, and then type out each plot command, like so.  

    par(mfrow=c(2,3))
    with(myData,plot(var1~BM,main="var1",pch=16,xlab="BodyMass",ylab="var1"))
    with(myData,plot(var2~BM,main="var2",pch=16,xlab="BodyMass",ylab="var2"))
    with(myData,plot(var3~BM,main="var3",pch=16,xlab="BodyMass",ylab="var3"))
    with(myData,plot(var4~BM,main="var4",pch=16,xlab="BodyMass",ylab="var4"))
    with(myData,plot(var5~BM,main="var5",pch=16,xlab="BodyMass",ylab="var5"))
    with(myData,plot(var6~BM,main="var6",pch=16,xlab="BodyMass",ylab="var6"))

    That's not too bad with just 6 variables, but would be annoying with 30 variables. And what if we want to change something about the way we are doing the plot?  We will have to change each one of the plotting commands....which I am way too lazy to do.  Only the variable name changes each time, everything else is exactly the same.  We have a clear case here for replacing our 6 plot commands with a single use of `lapply()`. Note: there are reasons (many of them stylistic) to avoiding explicit `for()` loops in R.  Here here is a good introduction to using the apply family of R functions.  

    click to enlarge