Sunday, September 28, 2014

The Evolution of Evolutionary Thought

I just found this document on my computer.  It is a short synopsis of the history of evolutionary thought that I wrote for a class in my first year of grad school.  It is not perfect or complete, but I thought it was worth posting.  Please add corrections or comments below.

The theory of evolution by Darwinian natural selection is of critical importance to modern paleontology, and the field of paleontology has had a decisive role in the justification of the theory.  Early ideas on the natural community of life followed Plato’s Principle of Plenitude.  This principle essentially stated that everything that could exist did in fact exist, and that forms could not be created nor could existing forms be destroyed.  This principle was at the heart of the idea of the scala naturae, or the Great Chain of Being, which envisioned the natural world as consisting of a hierarchical ordering of organisms from the simplest to the most complex (read most perfect), from the creeping slimes to the crowing achievement of creation, humans.  Humans were merely a step below cherubim and other various categories of angels. 

Scala Naturae

Thursday, January 30, 2014

Introduction to dplyr: data manipulation made easy(er) and fun(er)

UPDATE 01/13/15: This post is not up to date with the most current version of dplyr.  

Major differences include a new pipe operator (i.e. all instances of %.%. below should be %>%). There are likely other differences as well. 

If you are just getting started in R, checkout my post on good references for beginners

Hadley Wickham has come out with yet another R package that is destined to improve my workflow and let me concentrate less on getting R to do things, and more on my research questions. The package is dplyr, a reboot of an earlier package called plyr.

Behind both packages is the notion that it should be easy to do split-apply-combine operations on your data. These operations are where you group your observations by some categorical variable, do the some operation on each subset, and then recombine results.  The plyr package was already really good at this.

From my perspective, the 2 most important improvements in dplyr are
  1. a MASSIVE increase in speed, making dplyr useful on big data sets

    1. the ability to chain operations together in a natural order
    Here is a quick example of how you can do complicated stuff with dplyr. Example data are from the PanTHERIA1 dataset of life history traits across mammals.

    First, we read in the data from the web.  This step takes the longest of anything we will do, because we are reading a 2.5 MB text file into memory over http. It is a huge dataset with 5416 observations of  55 variables

    Edit: make sure you are using the most recent version of dplyr (0.1.1), or else you may have issues with your R session crashing.

    URL <- ""
    pantheria <- read.table(file=URL,header=TRUE,sep="\t",na.strings=c("-999","-999.00"))

    Next we will set up some factors with human readable levels.  Note that our variables aren't named very concisely, but we will leave them for now.

    pantheria$X1.1_ActivityCycle <-       
    levels(pantheria$X1.1_ActivityCycle) <- 

    pantheria$X6.2_TrophicLevel <-
    levels(pantheria$X6.2_TrophicLevel) <- 

    OK.  Now we are ready to show off the magic of dplyr.  We will use the %.%. operator to chain together commands to manipulate our dataframe.  First, we use the mutate()function to create a new column called yearlyOffspring, which is a transformation of two other columns.  Then, we pass that result to the filter function, and filter out just the rodents. Next, we add a group_by() clause, and finally, we use summarise(), to calculate the average body mass for each group. Type ?manip in the command line to see the full list of dplyr manipulation functions.

    Activity_Trophic <-
    pantheria %.% 
        mutate(yearlyOffspring = X16.1_LittersPerYear 
                * X16.1_LittersPerYear) %.%
        filter(MSW05_Order == "Rodentia") %.%
        group_by(X1.1_ActivityCycle,X6.2_TrophicLevel) %.% 
        summarise(meanBM = mean(X5.1_AdultBodyMass_g,na.rm=TRUE), 
                meanYO = mean(yearlyOffspring,na.rm=TRUE))

    This code yields the following, which is exactly what we want!

    ## Source: local data frame [16 x 4] ## Groups: X1.1_ActivityCycle ## ## X1.1_ActivityCycle X6.2_TrophicLevel meanBM meanYO ## 1 nocturnal carnivore 77.89 11.125 ## 2 cathermeral carnivore 227.89 6.250 ## 3 diurnal carnivore 88.34 NaN ## 4 cathermeral omnivore 306.65 5.947 ## 5 diurnal herbivore 1088.98 9.289 ## 6 cathermeral NA 97.97 15.526 ## 7 NA carnivore 51.35 20.250 ## 8 cathermeral herbivore 2625.65 14.154 ## 9 nocturnal herbivore 1172.53 9.793 ## 10 diurnal NA 364.47 21.415 ## 11 nocturnal omnivore 542.24 9.741 ## 12 NA omnivore 392.19 5.079 ## 13 nocturnal NA 205.08 16.200 ## 14 NA herbivore 452.95 6.264 ## 15 diurnal omnivore 300.05 3.438 ## 16 NA NA 220.48 11.508

    The beauty of the %.% operator is that it allows you to do things in the order in which you think about them. You start with your data, then mutate it, then filter it, then group and summarise. You could do the same process with plyr, or with base apply-family functions, but dplyr makes it MUCH cleaner and clearer.  

    Now we can visualize this data, and observe that there is a complex relationship between body mass and reproductive output in rodents!

    qplot(data = Activity_Trophic, 
    x = log(meanBM), 
    y = meanYO, 
    size = I(5), 
    shape = X1.1_ActivityCycle) + 

    Please share your experiences with dplyr in the comments section.

    1Jones KE, Bielby J, Cardillo M, Fritz SA, O’Dell J, Orme CDL, Safi K, Sechrest W, Boakes EH,   Carbone C, et al. 2009. PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. Ecology 90:2648.

    Thursday, January 2, 2014

    Discriminant Function Analysis and Phylogenetic Signal

    Rob Scott and I just published a new paper in AJPA on the issues surrounding the use of Discriminant Function Analysis (DFA) in ecomorphology.

    2014 - Barr, WA and Scott, RS. Phylogenetic comparative methods complement discriminant function analysis in ecomorphology.  American Journal of Physical Anthropology. 153:663 - 674. doi:10.1002/ajpa.22462

    Ecomorphology uses anatomical characteristics to predict the ecological context in which an organism lived. This is possible because organisms adapt anatomically to the functional requirements of their lifestyles. However, ecomorphology may be complicated by the fact that both morphological and ecological traits tend to have phylogenetic signal. In other words, closely related species tend to be more similar than more distantly related species. This can make it difficult to tease apart the effects of functional adaptation from those of phylogenetic signal. 

    One of the most common statistical methods in ecomorphology is DFA.  The purpose of our study was to evaluate the performance of DFA in situations with varying levels of phylogenetic signal.

    We used phylogenetic simulations to create datasets that were related to a phylogenetic tree, but were functionally unrelated to a set of ecological characteristics, which had varying levels of phylogenetic signal.  We simulated data in which (1) both the morphological characters and ecological categories had phylogenetic signal, (2) only the morphological characters had phylogenetic signal, (3) only the ecological category had phylogenetic signal, and (4) when neither the morphology nor the category had phylogenetic signal. 

    Remember: in all cases there was no biomechanical connection between habitat and morphology. We then ran DFA on the resulting datasets. The results are summarized in the figure below. 

    This figure shows the mean success rates of DFA on the vertical axis, and % of DFAs that were significant on the horizontal axis.  When we randomized habitats, DFAs were rarely significant. However, when the actual habitats (with phylogenetic signal) were used, the DFAs are very often statistically significant in cases where the morphological variables have phylogenetic signal.  We used Phylogenetic Generalized Least Squares (PGLS) on these same datasets, and found that PGLS reliably rejects the hypothesis of a biomechanical link between category and morphology.

    Thus, we concluded that PGLS should be used to validate characters before including them in DFA.