Question

In: Computer Science

# Write a function called `get_state_data` that allows you to specify a state, # then saves...

# Write a function called `get_state_data` that allows you to specify a state,
# then saves a .csv file (`STATE_data.csv`) with observations from that state
# This includes data about the state, as well as the counties in the state
# You should use the full any.drinking dataset in this function (not just 2012)

# Demonstrate that you function works by passing "Utah" to the function
state_Utah <- get_state_data(Utah)

############################ Binge drinking Dataset ############################

# In this section, you will ask a variety of questions regarding the
# `binge_drinking.csv` dataset. More specifically, you will analyze a subset of
# the observations of *just the counties* (exclude state/national estimates!).
# You will store your answers in a *named list*, and at the end of the section,
# Convert that list to a data frame, and write the data frame to a .csv file.
# Pay close attention to the *names* to be used in the list.


# Create a dataframe with only the county level observations from the
# `binge_driking.csv` dataset. You should (again) think of Washington D.C. as
# a state, and therefore *exclude it here*.
# However, you should include "county-like" areas such as parishes and boroughs
county_data <- binge.drinking.csv %>% distinct(state)

# Create an empty list in which to store answers to the questions below.


# What is the average county level of binge drinking in 2012 for both sexes?
# Store the number in your list as `avg_both_sexes`.


# What is the name of the county with the largest increase in male binge
# drinking between 2002 and 2012?
# Store the county name in your list as `largest_male_increase`.


# How many counties experienced an increase in male binge drinking between
# 2002 and 2012?
# Store the number in your list as `num_male_increase`.


# What fraction of counties experienced an increase in male binge drinking
# between 2002 and 2012?
# Store the fraction (num/total) in your list as `frac_male_increase`.

  
# How many counties experienced an increase in female binge drinking between
# 2002 and 2012?
# Store the number in your list as `num_female_increase`.


# What fraction of counties experienced an increase in female binge drinking
# between 2002 and 2012?
# Store the fraction (num/total) in your list as `frac_female_increase`.


# How many counties experienced a rise in female binge drinking *and*
# a decline in male binge drinking?
# Store the number in your list as `num_f_increase_m_decrease`.

# Convert your list to a data frame, and write the results
# to the file `binge_info.csv`

# The next questions return *data frames as results*:

# What is the *minimum* level of binge drinking in each state in 2012 for
# both sexes (across the counties)? Your answer should contain roughly 50 values
# (one for each state), unless there are two counties in a state with the
# same value. Your answer should be a *dataframe* with the location, state, and
# 2012 binge drinking rate. Write this to a file called `min_binge.csv`.


# What is the *maximum* level of binge drinking in each state in 2012 for
# both sexes (across the counties)? Your answer should contain roughly 50 values
# (one for each state), unless there are two counties in a state with the
# same value. Your answer should be a *dataframe* with the location, state, and
# 2012 binge drinking rate. Write this to a file called `max_binge.csv`.

  
################################# Joining Data #################################
# You'll often have to join different datasets together in order to ask more
# involved questions of your dataset. In order to join our datasets together,
# you'll have to rename their columns to differentiate them.


# First, rename all prevalence columns in the any_drinking dataset to the
# have prefix "any_" (i.e., `males_2002` should now be `any_males_2002`)
# Hint: you can get (and set!) column names using the colnames function.
# This may take multiple lines of code.


# Then, rename all prevalence columns in the binge_drinking dataset to the have
# the prefix "binge_" (i.e., `males_2002` should now be `binge_males_2002`)
# This may take multiple lines of code.


# Then, create a dataframe by joining together the both datasets.
# Think carefully about the *type* of join you want to do, and what the
# *identifying columns* are. You will use this (joined) data to answer the
# questions below.


# Create a column `diff_2012` storing the difference between `any` and `binge`
# drinking for both sexes in 2012


# Which location has the greatest *absolute* difference between `any` and
# `binge` drinking? Your answer should be a one row data frame with the state,
# location, and column of interest (diff_2012).
# Write this dataframe to `biggest_abs_diff_2012.csv`.


# Which location has the smallest *absolute* difference between `any` and
# `binge` drinking? Your answer should be a one row data frame with the state,
# location, and column of interest (diff_2012).
# Write this dataframe to `smallest_abs_diff_2012.csv`.

############## Write a function to ask your own question(s) ####################
# Even in an entry level data analyst role, people are expected to come up with
# their own questions of interest (not just answer the questions that other
# people have). For this section, you should *write a function* that allows you
# to ask the same question on different subsets of data. For example, you may
# want to ask about the highest/lowest drinking level given a state or year.
# The purpose of your function should be evident given the input parameters and
# function name. After writing your function, *demonstrate* that the function
# works by passing in different parameters to your function.


################################### Challenge ##################################

# Using your function from part 1 that wrote a .csv file given a state name,
# write a separate file for each of the 51 states (including Washington D.C.)
# The challenge is to do this in a *single line of (very concise) code*


# Write a function that allows you to pass in a *dataframe* (i.e., in the format
# of binge_drinking or any_drinking) *year*, and *state* of interest. The
# function should saves a .csv file with observations from that state's counties
# (and the state itself). It should only write the columns `state`, `location`,
# and data from the specified year. Before writing the .csv file, you should
# *sort* the data.frame in descending order by the both_sexes drinking rate in
# the specified year. The file name should have the format:
# `DRINKING_STATE_YEAR.csv` (i.e. `any_Utah_2005.csv`).
# To write this function, you will either have to use a combination of dplyr
# and base R, or confront how dplyr uses *non-standard evaluation*
# Hint: https://github.com/tidyverse/dplyr/blob/34423af89703b0772d59edcd0f3485295b629ab0/vignettes/nse.Rmd
# Hint: https://www.r-bloggers.com/non-standard-evaluation-and-standard-evaluation-in-dplyr/


# Create the file `binge_Colorado_2007.csv` using your function.

Solutions

Expert Solution

# a4-data-wrangling

# Before you get started, set your working directory using the Session menu.
# While we (mostly) don't require specific variable names, we will be checking
# your code (structure + style) as well as your output. The .csv files you save 
# must have the described format/column names, and the file name provided. 
# For all .csv file, make sure to exclude rownames, and write them to the 
# a folder called `output/` which you will create below.

################################### Set up ###################################

# Install (if not installed) + load dplyr package 
library("dplyr")

# Read in `any_drinking.csv` data using a *relative path*
any_drinking <- read.csv("./data/any_drinking.csv")

# Read in `binge.drinking.csv` data using a *relative path*
binge_drinking <- read.csv("./data/binge_drinking.csv")

# Create a directory (using R) called "output" in your project directory
# Make sure to *suppress any warnings*, in case the directory already exists
# You must save all .csv files in this directory (last reminder!)
dir.create("./output", showWarnings = FALSE)

############################# Any drinking in 2012 #############################

# For this first section, you will work only with the *any drinking* dataset.
# In particular, we'll focus on data from 2012. All output should include only 
# the relevant 2012 columns (as well as `state` + `location`), described below. 


# Create a new data.frame that has the `state` and `location` columns, 
# and all columns with data from 2012. you will use this dataframe throughout 
# the rest of this section.
state_location <- select(
                    any_drinking, location, state,
                    both_sexes_2012, males_2012,
                    females_2012
                  )

# Using the (new) 2012 data, create a column `diff` that has 
# the difference in male and female drinking rates
state_location <- mutate(state_location, diff = males_2012 - females_2012)

# Write your data to a file `diff_2012.csv` (in your `output/` directory)
# Make sure to exclude rownames (for all .csv files! -- last reminder).
write.csv(state_location, file = "./output/diff_2012.csv", row.names = FALSE)

# To answer "Are there any locations where females drink more than males"?
# Create a new dataframe by filtering the 2012 dataframe to the rows that 
# meet the criterion. Keep only the `state`, `location`, and column of interest. 
# Write your answer to `more_m_than_f.csv`.
# The answer is no, data frame is empty
# By 2012 data frame do they mean the change one?
more_m_than_f <- select(state_location, state, location, diff) %>%
  filter(diff < 0)

write.csv(more_m_than_f, file = "./output/more_m_than_f.csv", row.names = FALSE)

# To answer the question: "What is the location in which male and female 
# drinking rates are most similar", create a new dataframe by filtering the 2012 
# dataframe to the rows that meet the criterion. Keep only the `state`, 
# `location`, and column of interest. Write your answer to `most_similar.csv`.
# What do they mean by rows
most_similar <- select(state_location, location, state, diff) %>%
  filter(diff == min(abs(diff)))
write.csv(most_similar, file = "./output/most_similar.csv", row.names = FALSE)

# As you've (hopefully) noticed, the `location` column includes national, 
# state, and county level estimates. However, many audiences may only be 
# interested in the *state* level data. Given that, you should do the following:
# Create a new data frame that is only the *state level* observations in 2012.
# For the sake of this analysis, you should treat Washington D.C. as a *state*
# Write this data frame to `state_only.csv`.
# I noticed that counties always match the location column
state_only <- filter(
                state_location, as.character(location) == as.character(state)
              )
write.csv(state_only, file = "./output/state_only.csv", row.names = FALSE)

# Which state had the **highest** drinking rate for both sexes combined? 
# Your answer should be a *dataframe* of the state and value of interest
# Write this data frame to `highest_state.csv`.
highest_state <- select(state_only, state, both_sexes_2012) %>%
  filter(both_sexes_2012 == max(both_sexes_2012))
write.csv(highest_state, file = "./output/highest_state.csv", row.names = FALSE)

# Which state had the **lowest** drinking rate for both sexes combined?
# Your answer should be a *dataframe* of the state and value of interest
# Write this data frame to `lowest_state.csv`.
lowest_state <- select(state_only, state, both_sexes_2012) %>%
  filter(both_sexes_2012 == min(both_sexes_2012))
write.csv(lowest_state, file = "./output/lowest_state.csv", row.names = FALSE)

# What was the difference in prevalence between the state with the highest level
# of consumption,and the state with the lowest level of consumption?
# Your answer should be a single value (a dataframe storing one value is fine)
# Store your answer in a variable called `biggest_state_diff`.
biggest_state_diff <- highest_state$both_sexes_2012 -
                        lowest_state$both_sexes_2012


# Write a function called `get_state_data` that allows you to specify a state, 
# then saves a .csv file (`STATE_data.csv`) with observations from that state 
# This includes data about the state, as well as the counties in the state
# You should use the full any.drinking dataset in this function (not just 2012)
get_state_data <- function(state_search) {
  state_name <- filter(any_drinking, state == state_search)
  write.csv(state_name, file = paste0(
                                 "./output/", state_search,
                                 ".csv"), row.names = FALSE
                               )
}

# Demonstrate that you function works by passing "Utah" to the function
get_state_data("Utah")

############################ Binge drinking Dataset ############################

# In this section, you will ask a variety of questions regarding the 
# `binge_drinking.csv` dataset. More specifically, you will analyze a subset of 
# the observations of *just the counties* (exclude state/national estimates!).
# You will store your answers in a *named list*, and at the end of the section, 
# Convert that list to a data frame, and write the data frame to a .csv file.
# Pay close attention to the *names* to be used in the list.


# Create a dataframe with only the county level observations from the 
# `binge_driking.csv` dataset. You should (again) think of Washington D.C. as 
# a state, and therefore *exclude it here*.
# However, you should include "county-like" areas such as parishes and boroughs
county_only <- filter(
                 binge_drinking,
                 as.character(location) != as.character(state)
               ) %>%
                 filter(location != "United States")

# Create an empty list in which to store answers to the questions below.
answer_list <- list()

# What is the average county level of binge drinking in 2012 for both sexes?
# Store the number in your list as `avg_both_sexes`.
answer_list$avg_both_sexes <- mean(county_only$both_sexes_2012)

# What is the name of the county with the largest increase in male binge 
# drinking between 2002 and 2012?
# Store the county name in your list as `largest_male_increase`.
largest_male_increase <- filter(
                           county_only,
                           males_2012 - males_2002 == 
                             max(males_2012 - males_2002)
                           ) %>%
                             select(location)
answer_list$largest_male_increase <- largest_male_increase[1]

# How many counties experienced an increase in male binge drinking between
# 2002 and 2012?
# Store the number in your list as `num_male_increase`.
counties_male_increase <- filter(county_only, (males_2012 - males_2002) > 0)
answer_list$num_male_increase <- nrow(counties_increase)

# What fraction of counties experienced an increase in male binge drinking 
# between 2002 and 2012?
# Store the fraction (num/total) in your list as `frac_male_increase`.
# I could have just used the number of rows in county only but, figured this solution was more robust
answer_list$frac_male_increase <- nrow(counties_increase) / nrow(county_only)
                    
# How many counties experienced an increase in female binge drinking between
# 2002 and 2012?
# Store the number in your list as `num_female_increase`.
counties_female_increase <- filter(
                              county_only,
                              (females_2012 - females_2002) > 0
                            )
answer_list$num_female_increase <- nrow(counties_female_increase)

# What fraction of counties experienced an increase in female binge drinking 
# between 2002 and 2012?
# Store the fraction (num/total) in your list as `frac_female_increase`.
answer_list$frac_female_increase <- (nrow(counties_female_increase)
                                       / nrow(county_only))

# How many counties experienced a rise in female binge drinking *and* 
# a decline in male binge drinking?
# Store the number in your list as `num_f_increase_m_decrease`.
answer_list$num_f_increase_m_decrease <- nrow(
                                           filter(counties_female_increase,
                                           (males_2012 - males_2002) < 0)
                                         )

# Convert your list to a data frame, and write the results 
# to the file `binge_info.csv`
write.csv(
  data.frame(answer_list),
  file = "./output/binge_info.csv",
  row.names = FALSE
)

# The next questions return *data frames as results*:

# What is the *minimum* level of binge drinking in each state in 2012 for 
# both sexes (across the counties)? Your answer should contain roughly 50 values
# (one for each state), unless there are two counties in a state with the 
# same value. Your answer should be a *dataframe* with the location, state, and 
# 2012 binge drinking rate. Write this to a file called `min_binge.csv`.


min_binge <- group_by(county_only, state) %>%
              filter(both_sexes_2012 == min(both_sexes_2012)) %>%
              select(state, location, both_sexes_2012)

write.csv(
  data.frame(min_binge),
  file = "./output/min_binge.csv",
  row.names = FALSE
)

# What is the *maximum* level of binge drinking in each state in 2012 for 
# both sexes (across the counties)? Your answer should contain roughly 50 values
# (one for each state), unless there are two counties in a state with the 
# same value. Your answer should be a *dataframe* with the location, state, and 
# 2012 binge drinking rate. Write this to a file called `max_binge.csv`.

max_binge <- group_by(county_only, state) %>%
  filter(both_sexes_2012 == max(both_sexes_2012)) %>%
  select(state, location, both_sexes_2012)

write.csv(
  data.frame(max_binge),
  file = "./output/max_binge.csv",
  row.names = FALSE
)
################################# Joining Data #################################
# You'll often have to join different datasets together in order to ask more 
# involved questions of your dataset. In order to join our datasets together, 
# you'll have to rename their columns to differentiate them. 


# First, rename all prevalence columns in the any_drinking dataset to the 
# have prefix "any_" (i.e., `males_2002` should now be `any_males_2002`)
# Hint: you can get (and set!) column names using the colnames function. 
# This may take multiple lines of code.
colnames(any_drinking)[3:35] <- paste0("any_", colnames(any_drinking)[3:35])

# Then, rename all prevalence columns in the binge_drinking dataset to the have
# the prefix "binge_" (i.e., `males_2002` should now be `binge_males_2002`)
# This may take multiple lines of code.
colnames(binge_drinking)[3:35] <- paste0(
                                    "binge_",
                                    colnames(binge_drinking)[3:35])

# Then, create a dataframe by joining together the both datasets. 
# Think carefully about the *type* of join you want to do, and what the 
# *identifying columns* are. You will use this (joined) data to answer the 
# questions below.
joined_drinking <- left_join(binge_drinking, any_drinking)

# Create a column `diff_2012` storing the difference between `any` and `binge` 
# drinking for both sexes in 2012
joined_drinking <- mutate(
                     joined_drinking,
                     diff_2012 = (any_both_sexes_2012 - binge_both_sexes_2012)
                   )

# Which location has the greatest *absolute* difference between `any` and 
# `binge` drinking? Your answer should be a one row data frame with the state, 
# location, and column of interest (diff_2012). 
# Write this dataframe to `biggest_abs_diff_2012.csv`.
biggest_abs_diff_2012 <- filter(
                           joined_drinking, diff_2012 == max(abs(diff_2012))
                         )
write.csv(
  data.frame(biggest_abs_diff_2012),
  file = "./output/biggest_abs_diff_2012.csv",
  row.names = FALSE
)


# Which location has the smallest *absolute* difference between `any` and 
# `binge` drinking? Your answer should be a one row data frame with the state, 
# location, and column of interest (diff_2012). 
# Write this dataframe to `smallest_abs_diff_2012.csv`.
smallest_abs_diff_2012 <- filter(
                            joined_drinking,
                            diff_2012 == min(abs(diff_2012))
                          )
write.csv(
  data.frame(smallest_abs_diff_2012),
  file = "./output/smallest_abs_diff_2012.csv",
  row.names = FALSE
)

############## Write a function to ask your own question(s) ####################
# Even in an entry level data analyst role, people are expected to come up with 
# their own questions of interest (not just answer the questions that other 
# people have). For this section, you should *write a function* that allows you 
# to ask the same question on different subsets of data. For example, you may 
# want to ask about the highest/lowest drinking level given a state or year. 
# The purpose of your function should be evident given the input parameters and 
# function name. After writing your function, *demonstrate* that the function 
# works by passing in different parameters to your function.
# My question is "What is the location with the highest drinking level given a particular year?
highest_drinking_male <- function (year) {
  index <- 3 * (year - 2002) + 5
  highest_drinking <- filter(
                        any_drinking,
                        any_drinking[index] == max(any_drinking[index])
                      )
  highest_drinking[1, 2]
}
highest_drinking_male(2012)
################################### Challenge ##################################

# Using your function from part 1 that wrote a .csv file given a state name, 
# write a separate file for each of the 51 states (including Washington D.C.)
# The challenge is to do this in a *single line of (very concise) code*
lapply(any_drinking$state, get_state_data)

# Write a function that allows you to pass in a *dataframe* (i.e., in the format 
# of binge_drinking or any_drinking) *year*, and *state* of interest. The 
# function should saves a .csv file with observations from that state's counties
# (and the state itself). It should only write the columns `state`, `location`, 
# and data from the specified year. Before writing the .csv file, you should 
# *sort* the data.frame in descending order by the both_sexes drinking rate in 
# the specified year. The file name should have the format:
# `DRINKING_STATE_YEAR.csv` (i.e. `any_Utah_2005.csv`).
# To write this function, you will either have to use a combination of dplyr 
# and base R, or confront how dplyr uses *non-standard evaluation*
# Hint: https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html
csv_state_year <- function(dataframe, year, state_name) {
  drinking_type <- gsub("_.*", "", colnames(dataframe)[3])
  dataframe_sorted <- select_(
                        dataframe, "state", "location",
                        paste0(drinking_type, "_males_", year),
                        paste0(drinking_type, "_females_", year),
                        paste0(drinking_type, "_both_sexes_", year)
                      ) %>%
                        arrange_(paste0(
                                   "-", drinking_type,
                                   "_both_sexes_", year)
                                 ) %>%
                                  filter(state == state_name)
  write.csv(
    dataframe_sorted,
    file = paste0(
             "./output/", drinking_type,
             "_", state_name, "_", year, ".csv"
           )
  )
}

# Create the file `binge_Colorado_2007.csv` using your function.
csv_state_year(binge_drinking, "2007", "Colorado")

Related Solutions

Write the program WritePatientRecords that allows a doctor’s staff to enter data about patients and saves...
Write the program WritePatientRecords that allows a doctor’s staff to enter data about patients and saves the data to a file called Patients.txt. The output should be in the following format: p#, PATIENT_NAME, BALANCE. Create a Patient class that contains fields for ID number, name, and current balance owed to the doctor’s office. using System; using static System.Console; using System.IO; class WritePatientRecords { static void Main() { // Your code here } }
Write a Java program that allows the user to specify a file name on the command...
Write a Java program that allows the user to specify a file name on the command line and prints the number of characters, words, lines, average number of words per line, and average number of characters per word in that file. If the user does not specify any file name, then prompt the user for the name.
Create a C# application for Library that allows you to enter data for books and saves...
Create a C# application for Library that allows you to enter data for books and saves the data to a file named Books.txt. Create a Book class that contains fields for title, author, number of pages, and price of the book and ToString() method. The fields of records in the file are separated with asterisk (*). Expecting sentinel value for ending the process of writing to file is "000"
Write a class called CompoundManager that allows you to save all compounds in memory to a...
Write a class called CompoundManager that allows you to save all compounds in memory to a text or binary file and then read a list of compounds from a text or binary file and append these to those in memory, those that don’t already exist. In other words, if Ammonia is in memory and the system reads a file that contains Ammonia, then it is ignored, and it moves on to the next compound in the file. Please write the...
In a file called NumbersToMonths.java, write a program that: Asks the user to specify a month...
In a file called NumbersToMonths.java, write a program that: Asks the user to specify a month as an integer between 1 and 12. It is OK for your program to crash when the user does not enter a valid integer. Prints out the name of the corresponding month. Prints out "This number does not correspond to a month." if the integer is less than 1 or greater than 12. For example: if the user enters 1, your program output should...
3. For each case, write the degrees of freedom (F) and specify the state of the...
3. For each case, write the degrees of freedom (F) and specify the state of the water. Calculate quality or degrees of superheat whenever applicable. a) What is the quality of wet steam if V= 8.35 cm3 /g at 10,703 kPa? What is the specific internal Energy? What is the specific entropy? b) What is the specific enthalpy of steam at T= 200°C and P=525 kPa? c) What is the specific entropy of steam at 9100 kPa and 560° C?...
For some reason another programmer has asked you to write a function. The function is called...
For some reason another programmer has asked you to write a function. The function is called getWarYear ( ) It has one parameter , an integer. if it is 1 return a random number in the range of 1914 to 1919. if it is 2 return a random number in the range of 1938 to 1945. Since the function is returning an year the data type of the return value is an integer.
Write the code (in a function) that allows the user to display the first “nb” of...
Write the code (in a function) that allows the user to display the first “nb” of elements in the Fibonacci sequence (each value on a separate line in the console). The order of the Fibonacci sequence goes as follows: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 and on to infinity. Each number is the sum of the previous two (with the 2 starting numbers being 1 and 1). This series of numbers is known as...
Python 2. Please ensure you run the tests in idle Write a function orderPizza that allows...
Python 2. Please ensure you run the tests in idle Write a function orderPizza that allows the user input to build a pizza. It then prints a thank you message, the cost of the pizza and then returns the Pizza that was built. >>> orderPizza() Welcome to Python Pizza! What size pizza would you like (S,M,L): M Type topping to add (or Enter to quit): mushroom Type topping to add (or Enter to quit): onion Type topping to add (or...
Write, specify and prove the function reverse that reverses an array in place. Take care of...
Write, specify and prove the function reverse that reverses an array in place. Take care of the unmodified part of the array at some iteration of the loop. Assume that the swap function is already proved. Note: Prototype is as below. [7 M] [CO2] void swap(int* a, int* b); void reverse(int* array, size_t len){ }
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT