In: Statistics and Probability
The Loblolly data in R has several variables pertaining to growth records for Loblolly pines, a type of pine tree native to the Southeastern United States. Load this data in R and examine the help file with the following commands:
data("Loblolly") ?Loblolly
Boxplots
In order to get more comfortable examining model assumptions, we’d like to get familiar with R’s plotting capailibites. We will start by examining how height varies across different levels of seed. Since the seed variable has 14 levels, we will ask R for a subset of the data that includes only seeds 329, 315, and 305.
The following command creates this subset by taking Loblolly such that Seed is 329 or Seed is 315 or Seed is 305. The droplevels command cleans up the subsetted data so that it will plot nicely.
subset <- Loblolly[Loblolly$Seed == 329 | Loblolly$Seed == 315 | Loblolly$Seed == 305,] subset <- droplevels(subset)
To examine graphically how height varies across different levels of seed (in our subset of the data), we will start with a boxplot. Remember that we can use a ~ as “by”. That is, we want a boxplot of height by levels of seed. Recall also that we use the dollar sign $ to tell R that we want a particular variable from a dataset, i.e., dataset$variable.
boxplot(subset$height ~ subset$Seed)
This plot is a nice start, but it may look somewhat incomplete. It’s missing a title and could stand to have cleaner axis labels. The following command adds a main title and axis labels xlab and ylab:
boxplot(subset$height ~ subset$Seed, main = "Boxplot of Tree Height by Seed on Subsetted Data", xlab = "Seed", ylab="Height (ft)")
Histograms
While boxplots are a convenient way to make side-by-side comparisons, it can be difficult to conclusively answer questions like the one posed in Exercise 3. Histograms provide a much more straightforward way to examine the shape of a distribution. The following command creates a basic histogram for the Loblolly height variable:
hist(Loblolly$height)
We would again like to include a better title and new axis labels. Fortunately, R’s functions for plotting all use the same approach!
hist(Loblolly$height, main = "Histogram of Loblolly Pine Heights", xlab = "Height (ft)", ylab="Frequency")
Previously, we wanted information on normality for a subset of the data, using only seeds 329, 315, and 305. We can build individual histograms for these data. Recall that the square brackets can be read as “such that”.
hist(Loblolly$height[Loblolly$Seed == 329], main = "Histogram of Pine Heights for Seed 329", xlab = "Height (ft)", ylab="Frequency")
Scatterplots and Regression Lines
We may run into some problems with this data because there are only 6 observations per seed! It may be more reasonable to compare age and height of trees in the complete data.
If we want to examine the relationship between age and height, it is reasonable to think that we would be interested in using height to predict age. (If we walking through a forest of Loblolly pine trees, it will be much easier to get a tree’s height than its age!)
Using this predictor/response variable setting, we want to look at a scatterplot of the data to get an idea of whether there might be any correlation between the two. The following command creates this scatterplot, complete with a title and reasonable axis labels.
plot(x = Loblolly$height, y = Loblolly$age, main = "Scatterplot of Age vs Height", xlab = "Height (feet)", ylab = "Age (Years)") abline(a = 0.7574, b = 0.3783, col='red')
We’ll spend some time on regression next week, but for now the regression line is
ŷ =0.7574+0.3783xy^=0.7574+0.3783x
We can include this in our plot using the function abline. This function adds a line to an existing plot in R. The name “abline” refers to the way that lines are written in many an algebra class: y=a+bxy=a+bx. Include the regression line in your scatterplot by adding the following line of code right under the previous plot function:
Please show 6-9 specifically.
1.
data("Loblolly")
?Loblolly
There are 3 variables in the dataset Loblolly.
1. Height = numerical
2. Age = numerical
3. Seed = categorical
2.
subset <- Loblolly[Loblolly$Seed == 329 | Loblolly$Seed ==
315 | Loblolly$Seed == 305,]
subset <- droplevels(subset)
boxplot(subset$height ~ subset$Seed)
> boxplot(subset$height ~ subset$Seed,
+ main = "Boxplot of Tree Height by Seed on Subsetted Data",
+ xlab = "Seed", ylab="Height (ft)")
3.
hist(Loblolly$height)
hist(Loblolly$height,
main = "Histogram of Loblolly Pine Heights",
xlab = "Height (ft)", ylab="Frequency")
hist(Loblolly$height[Loblolly$Seed == 329],
main = "Histogram of Pine Heights for Seed 329",
xlab = "Height (ft)", ylab="Frequency")
4.
unique(Loblolly$height)
[1] 4.51 10.89 28.72 41.74 52.70 60.92 4.55 10.92 29.07 42.83
53.88 63.39
[13] 4.79 11.37 30.21 44.40 55.82 64.10 3.91 9.48 25.66 39.07 50.78
59.07
[25] 4.81 11.20 28.66 41.66 53.31 63.05 3.88 9.40 25.99 39.55 51.46
59.64
[37] 4.32 10.43 27.16 40.85 51.33 60.07 4.57 10.57 27.90 41.13
52.43 60.69
[49] 3.77 9.03 25.45 38.98 49.76 60.28 4.33 10.79 28.97 42.44 53.17
61.62
[61] 4.38 10.48 27.93 40.20 50.06 58.49 4.12 9.92 26.54 37.82 48.43
56.81
[73] 3.93 9.34 26.08 37.79 48.31 56.43 3.46 9.05 25.85 39.15 49.12
59.49
unique(Loblolly$age)
[1] 3 5 10 15 20 25
There are total 84 different values of height and 6 different value for age.
Scatter plot and regression:
plot(x = Loblolly$height, y = Loblolly$age, main = "Scatterplot of Age vs Height", xlab = "Height (feet)", ylab = "Age (Years)") abline(a = 0.7574, b = 0.3783, col='red')
Explainatory variable is height and age is the response variable.
From scatter plot we see that there exists linear relationship between tree age and tree height.
And it is positive correlation (relationship )
The given regression equation is,
ŷ =0.7574+0.3783x
Predicted age of a tree that is 20 feet tall is
ŷ =0.7574+0.3783 (20) = 8.3234
Hence Predicted age of a tree that is 20 feet tall is 8.3234