In: Statistics and Probability
(1) Read in the data and create an R data frame named tennis.dfr that has the following names for its columns: first.name, last.name, major.match.wins, major.match.losses, overall.match.wins, overall.match.losses, major.titles, overall.titles. (Note that the data file has several explanatory lines before the real data begin that should be skipped when reading in the data lines.) NOTE: For the file name, you must use the following web address (URL): "http://people.stat.sc.edu/hitchcock/tennisplayers2018.txt". Please do not have your code read in the file from your own personal directory. (2) Create and add two more columns called major.winning.pct and overall.winning.pct (showing winning percentage in the "major" and "overall" categories, respectively) to this data frame. Note that "winning percentage" is defined as (match wins)/(match wins + match losses). (3) Sort the data frame by major titles, from most to least. Have your program print the sorted data frame. (4) Perform a nested sort, sorting the data frame first by major titles (from most to least), and then by major winning percentage (from most to least) within major-title levels. Have your program print this sorted data frame. (5) Have R extract the subset of the data frame consisting of players with at least 6 major titles. Call this new data frame: greatest.dfr Have your program print this new data frame. (6) In the most efficient way possible, have R calculate the sample means for each of the numeric variables in the tennis.dfr data set. (Hint: Extract the appropriate subset of the data frame first.) (7) Use the write.table() function to write the data set tennis.dfr to an external file simply called "tennisdata.txt". Make sure the external file includes the column names. Also, make sure the players' names are NOT surrounded by quotes in the external file.
(1) R-Code:
data =
read.table("http://people.stat.sc.edu/hitchcock/tennisplayers2018.txt",
header = F, fill = TRUE, skip = 7)
colnames(data) = c("first.name", "last.name", "major.match.wins",
"major.match.losses",
"overall.match.wins", "overall.match.losses", "major.titles",
"overall.titles")
(2)
data$major.winning.pct =
(data$major.match.wins)/(data$major.match.wins +
data$major.match.losses)
data$overall.winning.pct =
(data$overall.match.wins)/(data$overall.match.wins +
data$overall.match.losses)
(3)
data = data[order(-data$major.titles),]
View(data)
(4)
data = data[order(-data$major.titles,
-data$major.winning.pct),]
View(data)
(5)
library(dplyr)
data1 = data %>% filter(major.titles >= 6)
(6)
sapply(data[3:8], mean)
major.match.wins major.match.losses overall.match.wins
overall.match.losses
159.966667 41.700000 700.900000 225.833333
major.titles overall.titles
6.366667 50.700000
(7)
write.table(data, "tennisdata.txt", col.names = T, row.names = F, quote = F, sep = " ", qmethod = "double")