Question

In: Statistics and Probability

The Loblolly data in R has several variables pertaining to growth records for Loblolly pines, a...

The Loblolly data in R has several variables pertaining to growth records for Loblolly pines, a type of pine tree native to the Southeastern United States. Load this data in R and examine the help file with the following commands:

data("Loblolly")
?Loblolly
  1. What are the variables in this dataset? Are they numeric or categorical? (6 pts)

Boxplots

In order to get more comfortable examining model assumptions, we’d like to get familiar with R’s plotting capailibites. We will start by examining how height varies across different levels of seed. Since the seed variable has 14 levels, we will ask R for a subset of the data that includes only seeds 329, 315, and 305.

The following command creates this subset by taking Loblolly such that Seed is 329 or Seed is 315 or Seed is 305. The droplevels command cleans up the subsetted data so that it will plot nicely.

subset <- Loblolly[Loblolly$Seed == 329 | Loblolly$Seed == 315 | Loblolly$Seed == 305,]
subset <- droplevels(subset)

To examine graphically how height varies across different levels of seed (in our subset of the data), we will start with a boxplot. Remember that we can use a ~ as “by”. That is, we want a boxplot of height by levels of seed. Recall also that we use the dollar sign $ to tell R that we want a particular variable from a dataset, i.e., dataset$variable.

boxplot(subset$height ~ subset$Seed)

This plot is a nice start, but it may look somewhat incomplete. It’s missing a title and could stand to have cleaner axis labels. The following command adds a main title and axis labels xlab and ylab:

boxplot(subset$height ~ subset$Seed, 
        main = "Boxplot of Tree Height by Seed on Subsetted Data",
        xlab = "Seed", ylab="Height (ft)")
  1. Do you think that height differs between different values of seed? (2pts)
  1. Do the three height groups look normally distributed? (2pts)

Histograms

While boxplots are a convenient way to make side-by-side comparisons, it can be difficult to conclusively answer questions like the one posed in Exercise 3. Histograms provide a much more straightforward way to examine the shape of a distribution. The following command creates a basic histogram for the Loblolly height variable:

hist(Loblolly$height)

We would again like to include a better title and new axis labels. Fortunately, R’s functions for plotting all use the same approach!

hist(Loblolly$height,
     main = "Histogram of Loblolly Pine Heights",
     xlab = "Height (ft)", ylab="Frequency")
  1. Do the Loblolly pine heights appear to be normally distributed? (2pts)

Previously, we wanted information on normality for a subset of the data, using only seeds 329, 315, and 305. We can build individual histograms for these data. Recall that the square brackets can be read as “such that”.

hist(Loblolly$height[Loblolly$Seed == 329],
     main = "Histogram of Pine Heights for Seed 329",
     xlab = "Height (ft)", ylab="Frequency")
  1. Create histograms of the pine heights for the other two seeds, 315 and 305. For each seed, decide whether it is reasonable to assume that the heights are normally distributed. (4pts)

Scatterplots and Regression Lines

We may run into some problems with this data because there are only 6 observations per seed! It may be more reasonable to compare age and height of trees in the complete data.

  1. How many observations are there for height? (2pts) How many differet values are there for age? (2pts) Show your R code to find the answser the two questions above. (4pts)

If we want to examine the relationship between age and height, it is reasonable to think that we would be interested in using height to predict age. (If we walking through a forest of Loblolly pine trees, it will be much easier to get a tree’s height than its age!)

  1. In this setting, which is the explanatory variable (predictor)? Which is the response? (2pts)

Using this predictor/response variable setting, we want to look at a scatterplot of the data to get an idea of whether there might be any correlation between the two. The following command creates this scatterplot, complete with a title and reasonable axis labels.

plot(x = Loblolly$height, y = Loblolly$age,
     main = "Scatterplot of Age vs Height",
     xlab = "Height (feet)", ylab = "Age (Years)")
abline(a = 0.7574, b = 0.3783, col='red')
  1. Is there evidence of a linear relationship between tree age and tree height? (2pts) Without doing any math or using the computer, take a guess as to what the correlation might be for these two variables. (2pts)

We’ll spend some time on regression next week, but for now the regression line is

ŷ =0.7574+0.3783xy^=0.7574+0.3783x

We can include this in our plot using the function abline. This function adds a line to an existing plot in R. The name “abline” refers to the way that lines are written in many an algebra class: y=a+bxy=a+bx. Include the regression line in your scatterplot by adding the following line of code right under the previous plot function:

  1. Predict the age of a tree that is 20 feet tall. (3pts)

Please show 6-9 specifically.

Solutions

Expert Solution

1.

data("Loblolly")
?Loblolly

There are 3 variables in the dataset Loblolly.

1. Height = numerical

2. Age = numerical

3. Seed = categorical

2.

subset <- Loblolly[Loblolly$Seed == 329 | Loblolly$Seed == 315 | Loblolly$Seed == 305,]
subset <- droplevels(subset)
boxplot(subset$height ~ subset$Seed)

> boxplot(subset$height ~ subset$Seed,
+ main = "Boxplot of Tree Height by Seed on Subsetted Data",
+ xlab = "Seed", ylab="Height (ft)")

3.

hist(Loblolly$height)

hist(Loblolly$height,
main = "Histogram of Loblolly Pine Heights",
xlab = "Height (ft)", ylab="Frequency")


hist(Loblolly$height[Loblolly$Seed == 329],
main = "Histogram of Pine Heights for Seed 329",
xlab = "Height (ft)", ylab="Frequency")

4.

unique(Loblolly$height)

[1] 4.51 10.89 28.72 41.74 52.70 60.92 4.55 10.92 29.07 42.83 53.88 63.39
[13] 4.79 11.37 30.21 44.40 55.82 64.10 3.91 9.48 25.66 39.07 50.78 59.07
[25] 4.81 11.20 28.66 41.66 53.31 63.05 3.88 9.40 25.99 39.55 51.46 59.64
[37] 4.32 10.43 27.16 40.85 51.33 60.07 4.57 10.57 27.90 41.13 52.43 60.69
[49] 3.77 9.03 25.45 38.98 49.76 60.28 4.33 10.79 28.97 42.44 53.17 61.62
[61] 4.38 10.48 27.93 40.20 50.06 58.49 4.12 9.92 26.54 37.82 48.43 56.81
[73] 3.93 9.34 26.08 37.79 48.31 56.43 3.46 9.05 25.85 39.15 49.12 59.49

unique(Loblolly$age)

[1] 3 5 10 15 20 25

There are total 84 different values of height and 6 different value for age.

Scatter plot and regression:

plot(x = Loblolly$height, y = Loblolly$age, main = "Scatterplot of Age vs Height", xlab = "Height (feet)", ylab = "Age (Years)") abline(a = 0.7574, b = 0.3783, col='red')

Explainatory variable is height and age is the response variable.

From scatter plot we see that there exists linear relationship between tree age and tree height.

And it is positive correlation (relationship )

The given regression equation is,

ŷ =0.7574+0.3783x

Predicted age of a tree that is 20 feet tall is

ŷ =0.7574+0.3783 (20) = 8.3234

Hence Predicted age of a tree that is 20 feet tall is 8.3234


Related Solutions

The Wade Tract Preserve in Georgia is an old-growth forest of longleaf pines that has survived...
The Wade Tract Preserve in Georgia is an old-growth forest of longleaf pines that has survived in a relatively undisturbed state for hundreds of years. One question of interest to foresters who study the area is “How do the sizes of longleaf pine trees in the northern and southern halves of the forest compare?” To find out, researchers took random samples of 30 trees from each half and measured the diameter at breast height (in centimeters). Here are the summary...
How to read variables in SPSS and R for unstandardized and Standardized data
How to read variables in SPSS and R for unstandardized and Standardized data
Function of Several Variables
Let \( f:R_+^* ×R_+^* \) \( -> R \) be a function defined by \( f(x,y)=∫_0^\frac{\pi}{2} ln⁡(x^2 sin^2⁡t+y^2 cos^2⁡t)dt \)  (a). Show that for all \( x,y >0 : \bigtriangledown f(x,y)= (\frac{\pi}{x+y},\frac{\pi}{x+y}) \) (b). Deduce that for all \( x,y>0 : f(x,y)=\pi ln(\frac{x+y}{2}) \)
Function of Several Variables
Compute the directional derivative aong, \( u \) , at the indicated points . (a). \( f(x,y)=x\sqrt{y-3} \)  \( ,u=(-1,6) \)  \( a=(2,12) \) (b). \( f(x,y,z)=\frac{1}{x+2y-3z} \) , \( u=(12,-9,-4) \) \( a=(1,1,-1) \)
Function of Several Variables
Determine \( f:R->R \) twice differentiable such that for function  \( \varphi \) defined by  \( \varphi(x,y)=f(\frac{x}{y}) \) satisfies \( \frac{\partial^2\varphi}{\partial x^2}+\frac{\partial^2\varphi}{\partial y^2}=0 \)
Function of Several Variables
Determine the function \( f : R^2->R \) satisies \( \frac{\partial f}{\partial x}(x,y)=2xy \)  and \( \frac{\partial f}{\partial y}(x,y)=x^2+2y \)
Use the data values in the table below to calculate the correlation, r, between the variables...
Use the data values in the table below to calculate the correlation, r, between the variables x and y. x y 4 12.32 5 13.1 6 16.68 7 20.26 8 22.44 9 22.72 10 23.4 11 26.98 12 28.66 Give your answer to rounded three decimal places.
Consider the following data for a project to install a new server at the Northland Pines...
Consider the following data for a project to install a new server at the Northland Pines High School. Activity Activity Time (days) Immediate Predecessor(s) A 2 — B 4 A C 5 A D 2 B E 1 B F 8 B, C G 3 D, E H 5 F I 4 F J 7 G, H, I Draw the network diagram. Calculate the critical path for this project. How much slack is in each of the activities G, H,...
Figure out that R and Rapid-Miner yield the similar number of records in the combined data...
Figure out that R and Rapid-Miner yield the similar number of records in the combined data set or not.
If Data A has a correlation coefficient of r = -0.991, and Data B has a...
If Data A has a correlation coefficient of r = -0.991, and Data B has a correlation coefficient of r = 0.991, which correlation is correct? Select one: a. Data A and Data B have the same strength in linear correlation. b. Data A has a weaker linear correlation than Data B. c. Data A has a stronger linear correlation than Data B. Clear my choice Question 12 Not yet answered Marked out of 1.00 Flag question Question text The...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT