Creating Multivariate Regression Trees (MRT) using R-package

Jenna Jacobs - email@jennajacobs.org

Last updated - January 2012

Multivariate Regression Trees is a technique originally described by De'ath (2002). This is a record of my personal journey in learning how to apply this technique to my data. I am sure there are better ways, and I hope by sharing my experiences others will share theirs to make this technique more useful.

**Overview**

I) Installing R-package and mvpart package from CRAN

II) Preparing Data and importing into R

III) Creating a MRT in R

IV) Hints and tricks

V) Notes and Problems

VI) References

I) Installing R-package and mvpart package from CRAN

1) Download and install the setup file from the r-project website

(at the time of writing the the setup file was available here)

2) Open R and install the mvpart package by selecting Install package(s) from CRAN...

3) Select mvpart from the list and click OK

Now you are ready to do a MRT, but first you need to get your data into R!!!!

II) Preparing Data and importing into R (old way)

1) In Excel create a table with sites/samples/plots as row headings and species then independent variables as column headings.

2)Save the file as a .csv (comma separated values), by choosing "Save as" from the File menu, and change Save as type to "CSV(Comma delimited)"

You are now ready to import your data into R

3) Open the R program

4) Change the working directory to where you saved you .csv file from step 2 by selecting "Change dir" from the file menu and browsing to the proper directory

5) In order to import your file into R you need to use the read.csv command. To find out more about this command type "?read.csv" without the quotation marks at the pompt.To use the read.csv command type:

>name<- read.csv("filename.csv", row.names=1)

name: the name of the file in R

filename: the name you called your file in excel

row.names=1, use this if the first column of your data are the names of your rows (i.e."Sample 1")

1) Load the mvpart package by selecting "Load Package..." from the package menu. Then select "mvpart" and click ok.

2) Load you data by typing "Data (name)" at the command prompt (ie. "Data (MRT-Sample)"

3)To create a MRT using Euclidian distance measure, at the command line type:>mvpart(data.matrix(

name[,1:12])~Variable 1+Variable 2+Variable 3+Variable 4+Variable 5+...,name)

(1:12 are the columns containing the species, Variable 1,2,3... are the names of the columns of the independent variables)

4) To create a MRT using Bray-Curtis distance measure, at the command line type:

> mvpart(gdist(

name[,1:12],meth="bray",full=TRUE,sq=TRUE)~Variable 1+Variable 2+Variable 3+Variable 4+Variable 5+...,name,method="mrt")

Variance explained by each node

To find the variance explained by each node, write the results of the tree to a file a then find the summary of the file.

1) To write the results of the tree to a file, preceed the command line with * name*<- , just like went initially reading the csv.

> mrtspider<- mvpart(gdist(spider[,1:12],meth="bray",full=TRUE,sq=TRUE)

~herbs+reft+moss+sand+twigs+water,spider,method="mrt",xv="1se", which="4")

2) Then find the summary of this file:

> summary(mrtspider)

3) The variance explained can be calculated from this table:

CP | nsplit | rel error | xerror | xstd | |

1 | 0.556393 | 0 | 1 | 1.071913 | 0.125958 |

2 | 0.200243 | 1 | 0.443608 | 0.763165 | 0.154415 |

3 | 0.086731 | 2 | 0.243365 | 0.609107 | 0.175033 |

When "nsplit" is 0 the relative error is 1, so the variance explained (1-rel error) is 0.

When nsplit is 1 the relative error is 0.44, so the variance explained by the first split is 0.56

When nsplit is 2 the relative error is 0.24, so the variance explained by both splits is 0.76.

By simple subtraction the variance explained by just the second node is 0.20.

- to make this method reliable I feel that a large number of trees should be run. For my personal application of this method I change the xv to equal "1se", so R will pick the best tree within one SE of the overall best, and get R to create a large number of trees (>50) and then pick the tree that is most consistently produced. There is probably a way to run make R run 100 trees and give a summary of the results.

Solution: add xvmult as an argument to the mvpart function

mvpart(gdist(spider[,1:12],meth="bray",full=TRUE,sq=TRUE) ~herbs+reft+moss+sand+twigs+water,spider,method="mrt",xv="1se", which="4", xvmult=50)

- to make this method more useful you need to be able to create a table with the information seen in table 1 in De'ath's paper. I can not figure out how to reproduce this table and would really like too. I will keep trying to do this and will update this page when I can.

Solution: a new package has just arrived that will create this table.

-the graphs produced in the MRT using Euclidian distance are the species across the x-axis and abundance across the y-axis. The graphs produced using Bray-Curtis Distance measure creates a different graph with species along the x-axis and I believe the sum of squares on the y-axis.

-a better understanding of what exactly the Error, CV Error and SE numbers mean. From De'ath's paper he states: "accuracy is better estimated from the cross-validated relative error (CVRE). CVRE varies from zero for perfect predictor to close to one for a poor predictor.", which I assume is the CV Error term produced with the tree. The inverse of the error term is the amount of variation described by the tree. Therefore, you want a low error and a low CV error.

De'Ath, G. 2002. Multivariate regression trees: a new technique for modeling species environment

relationships. Ecology. 83:1105-1117.