In: Computer Science
# Write a function called `get_state_data` that allows you to
specify a state,
# then saves a .csv file (`STATE_data.csv`) with observations from
that state
# This includes data about the state, as well as the counties in
the state
# You should use the full any.drinking dataset in this function
(not just 2012)
# Demonstrate that you function works by passing "Utah" to the
function
state_Utah <- get_state_data(Utah)
############################ Binge drinking Dataset ############################
# In this section, you will ask a variety of questions regarding
the
# `binge_drinking.csv` dataset. More specifically, you will analyze
a subset of
# the observations of *just the counties* (exclude state/national
estimates!).
# You will store your answers in a *named list*, and at the end of
the section,
# Convert that list to a data frame, and write the data frame to a
.csv file.
# Pay close attention to the *names* to be used in the list.
# Create a dataframe with only the county level observations from
the
# `binge_driking.csv` dataset. You should (again) think of
Washington D.C. as
# a state, and therefore *exclude it here*.
# However, you should include "county-like" areas such as parishes
and boroughs
county_data <- binge.drinking.csv %>% distinct(state)
# Create an empty list in which to store answers to the questions below.
# What is the average county level of binge drinking in 2012 for
both sexes?
# Store the number in your list as `avg_both_sexes`.
# What is the name of the county with the largest increase in male
binge
# drinking between 2002 and 2012?
# Store the county name in your list as
`largest_male_increase`.
# How many counties experienced an increase in male binge drinking
between
# 2002 and 2012?
# Store the number in your list as `num_male_increase`.
# What fraction of counties experienced an increase in male binge
drinking
# between 2002 and 2012?
# Store the fraction (num/total) in your list as
`frac_male_increase`.
# How many counties experienced an increase in female binge
drinking between
# 2002 and 2012?
# Store the number in your list as `num_female_increase`.
# What fraction of counties experienced an increase in female binge
drinking
# between 2002 and 2012?
# Store the fraction (num/total) in your list as
`frac_female_increase`.
# How many counties experienced a rise in female binge drinking
*and*
# a decline in male binge drinking?
# Store the number in your list as `num_f_increase_m_decrease`.
# Convert your list to a data frame, and write the results
# to the file `binge_info.csv`
# The next questions return *data frames as results*:
# What is the *minimum* level of binge drinking in each state in
2012 for
# both sexes (across the counties)? Your answer should contain
roughly 50 values
# (one for each state), unless there are two counties in a state
with the
# same value. Your answer should be a *dataframe* with the
location, state, and
# 2012 binge drinking rate. Write this to a file called
`min_binge.csv`.
# What is the *maximum* level of binge drinking in each state in
2012 for
# both sexes (across the counties)? Your answer should contain
roughly 50 values
# (one for each state), unless there are two counties in a state
with the
# same value. Your answer should be a *dataframe* with the
location, state, and
# 2012 binge drinking rate. Write this to a file called
`max_binge.csv`.
################################# Joining Data
#################################
# You'll often have to join different datasets together in order to
ask more
# involved questions of your dataset. In order to join our datasets
together,
# you'll have to rename their columns to differentiate them.
# First, rename all prevalence columns in the any_drinking dataset
to the
# have prefix "any_" (i.e., `males_2002` should now be
`any_males_2002`)
# Hint: you can get (and set!) column names using the colnames
function.
# This may take multiple lines of code.
# Then, rename all prevalence columns in the binge_drinking dataset
to the have
# the prefix "binge_" (i.e., `males_2002` should now be
`binge_males_2002`)
# This may take multiple lines of code.
# Then, create a dataframe by joining together the both
datasets.
# Think carefully about the *type* of join you want to do, and what
the
# *identifying columns* are. You will use this (joined) data to
answer the
# questions below.
# Create a column `diff_2012` storing the difference between `any`
and `binge`
# drinking for both sexes in 2012
# Which location has the greatest *absolute* difference between
`any` and
# `binge` drinking? Your answer should be a one row data frame with
the state,
# location, and column of interest (diff_2012).
# Write this dataframe to `biggest_abs_diff_2012.csv`.
# Which location has the smallest *absolute* difference between
`any` and
# `binge` drinking? Your answer should be a one row data frame with
the state,
# location, and column of interest (diff_2012).
# Write this dataframe to `smallest_abs_diff_2012.csv`.
############## Write a function to ask your own question(s)
####################
# Even in an entry level data analyst role, people are expected to
come up with
# their own questions of interest (not just answer the questions
that other
# people have). For this section, you should *write a function*
that allows you
# to ask the same question on different subsets of data. For
example, you may
# want to ask about the highest/lowest drinking level given a state
or year.
# The purpose of your function should be evident given the input
parameters and
# function name. After writing your function, *demonstrate* that
the function
# works by passing in different parameters to your function.
################################### Challenge
##################################
# Using your function from part 1 that wrote a .csv file given a
state name,
# write a separate file for each of the 51 states (including
Washington D.C.)
# The challenge is to do this in a *single line of (very concise)
code*
# Write a function that allows you to pass in a *dataframe* (i.e.,
in the format
# of binge_drinking or any_drinking) *year*, and *state* of
interest. The
# function should saves a .csv file with observations from that
state's counties
# (and the state itself). It should only write the columns `state`,
`location`,
# and data from the specified year. Before writing the .csv file,
you should
# *sort* the data.frame in descending order by the both_sexes
drinking rate in
# the specified year. The file name should have the format:
# `DRINKING_STATE_YEAR.csv` (i.e. `any_Utah_2005.csv`).
# To write this function, you will either have to use a combination
of dplyr
# and base R, or confront how dplyr uses *non-standard
evaluation*
# Hint:
https://github.com/tidyverse/dplyr/blob/34423af89703b0772d59edcd0f3485295b629ab0/vignettes/nse.Rmd
# Hint:
https://www.r-bloggers.com/non-standard-evaluation-and-standard-evaluation-in-dplyr/
# Create the file `binge_Colorado_2007.csv` using your
function.
# a4-data-wrangling # Before you get started, set your working directory using the Session menu. # While we (mostly) don't require specific variable names, we will be checking # your code (structure + style) as well as your output. The .csv files you save # must have the described format/column names, and the file name provided. # For all .csv file, make sure to exclude rownames, and write them to the # a folder called `output/` which you will create below. ################################### Set up ################################### # Install (if not installed) + load dplyr package library("dplyr") # Read in `any_drinking.csv` data using a *relative path* any_drinking <- read.csv("./data/any_drinking.csv") # Read in `binge.drinking.csv` data using a *relative path* binge_drinking <- read.csv("./data/binge_drinking.csv") # Create a directory (using R) called "output" in your project directory # Make sure to *suppress any warnings*, in case the directory already exists # You must save all .csv files in this directory (last reminder!) dir.create("./output", showWarnings = FALSE) ############################# Any drinking in 2012 ############################# # For this first section, you will work only with the *any drinking* dataset. # In particular, we'll focus on data from 2012. All output should include only # the relevant 2012 columns (as well as `state` + `location`), described below. # Create a new data.frame that has the `state` and `location` columns, # and all columns with data from 2012. you will use this dataframe throughout # the rest of this section. state_location <- select( any_drinking, location, state, both_sexes_2012, males_2012, females_2012 ) # Using the (new) 2012 data, create a column `diff` that has # the difference in male and female drinking rates state_location <- mutate(state_location, diff = males_2012 - females_2012) # Write your data to a file `diff_2012.csv` (in your `output/` directory) # Make sure to exclude rownames (for all .csv files! -- last reminder). write.csv(state_location, file = "./output/diff_2012.csv", row.names = FALSE) # To answer "Are there any locations where females drink more than males"? # Create a new dataframe by filtering the 2012 dataframe to the rows that # meet the criterion. Keep only the `state`, `location`, and column of interest. # Write your answer to `more_m_than_f.csv`. # The answer is no, data frame is empty # By 2012 data frame do they mean the change one? more_m_than_f <- select(state_location, state, location, diff) %>% filter(diff < 0) write.csv(more_m_than_f, file = "./output/more_m_than_f.csv", row.names = FALSE) # To answer the question: "What is the location in which male and female # drinking rates are most similar", create a new dataframe by filtering the 2012 # dataframe to the rows that meet the criterion. Keep only the `state`, # `location`, and column of interest. Write your answer to `most_similar.csv`. # What do they mean by rows most_similar <- select(state_location, location, state, diff) %>% filter(diff == min(abs(diff))) write.csv(most_similar, file = "./output/most_similar.csv", row.names = FALSE) # As you've (hopefully) noticed, the `location` column includes national, # state, and county level estimates. However, many audiences may only be # interested in the *state* level data. Given that, you should do the following: # Create a new data frame that is only the *state level* observations in 2012. # For the sake of this analysis, you should treat Washington D.C. as a *state* # Write this data frame to `state_only.csv`. # I noticed that counties always match the location column state_only <- filter( state_location, as.character(location) == as.character(state) ) write.csv(state_only, file = "./output/state_only.csv", row.names = FALSE) # Which state had the **highest** drinking rate for both sexes combined? # Your answer should be a *dataframe* of the state and value of interest # Write this data frame to `highest_state.csv`. highest_state <- select(state_only, state, both_sexes_2012) %>% filter(both_sexes_2012 == max(both_sexes_2012)) write.csv(highest_state, file = "./output/highest_state.csv", row.names = FALSE) # Which state had the **lowest** drinking rate for both sexes combined? # Your answer should be a *dataframe* of the state and value of interest # Write this data frame to `lowest_state.csv`. lowest_state <- select(state_only, state, both_sexes_2012) %>% filter(both_sexes_2012 == min(both_sexes_2012)) write.csv(lowest_state, file = "./output/lowest_state.csv", row.names = FALSE) # What was the difference in prevalence between the state with the highest level # of consumption,and the state with the lowest level of consumption? # Your answer should be a single value (a dataframe storing one value is fine) # Store your answer in a variable called `biggest_state_diff`. biggest_state_diff <- highest_state$both_sexes_2012 - lowest_state$both_sexes_2012 # Write a function called `get_state_data` that allows you to specify a state, # then saves a .csv file (`STATE_data.csv`) with observations from that state # This includes data about the state, as well as the counties in the state # You should use the full any.drinking dataset in this function (not just 2012) get_state_data <- function(state_search) { state_name <- filter(any_drinking, state == state_search) write.csv(state_name, file = paste0( "./output/", state_search, ".csv"), row.names = FALSE ) } # Demonstrate that you function works by passing "Utah" to the function get_state_data("Utah") ############################ Binge drinking Dataset ############################ # In this section, you will ask a variety of questions regarding the # `binge_drinking.csv` dataset. More specifically, you will analyze a subset of # the observations of *just the counties* (exclude state/national estimates!). # You will store your answers in a *named list*, and at the end of the section, # Convert that list to a data frame, and write the data frame to a .csv file. # Pay close attention to the *names* to be used in the list. # Create a dataframe with only the county level observations from the # `binge_driking.csv` dataset. You should (again) think of Washington D.C. as # a state, and therefore *exclude it here*. # However, you should include "county-like" areas such as parishes and boroughs county_only <- filter( binge_drinking, as.character(location) != as.character(state) ) %>% filter(location != "United States") # Create an empty list in which to store answers to the questions below. answer_list <- list() # What is the average county level of binge drinking in 2012 for both sexes? # Store the number in your list as `avg_both_sexes`. answer_list$avg_both_sexes <- mean(county_only$both_sexes_2012) # What is the name of the county with the largest increase in male binge # drinking between 2002 and 2012? # Store the county name in your list as `largest_male_increase`. largest_male_increase <- filter( county_only, males_2012 - males_2002 == max(males_2012 - males_2002) ) %>% select(location) answer_list$largest_male_increase <- largest_male_increase[1] # How many counties experienced an increase in male binge drinking between # 2002 and 2012? # Store the number in your list as `num_male_increase`. counties_male_increase <- filter(county_only, (males_2012 - males_2002) > 0) answer_list$num_male_increase <- nrow(counties_increase) # What fraction of counties experienced an increase in male binge drinking # between 2002 and 2012? # Store the fraction (num/total) in your list as `frac_male_increase`. # I could have just used the number of rows in county only but, figured this solution was more robust answer_list$frac_male_increase <- nrow(counties_increase) / nrow(county_only) # How many counties experienced an increase in female binge drinking between # 2002 and 2012? # Store the number in your list as `num_female_increase`. counties_female_increase <- filter( county_only, (females_2012 - females_2002) > 0 ) answer_list$num_female_increase <- nrow(counties_female_increase) # What fraction of counties experienced an increase in female binge drinking # between 2002 and 2012? # Store the fraction (num/total) in your list as `frac_female_increase`. answer_list$frac_female_increase <- (nrow(counties_female_increase) / nrow(county_only)) # How many counties experienced a rise in female binge drinking *and* # a decline in male binge drinking? # Store the number in your list as `num_f_increase_m_decrease`. answer_list$num_f_increase_m_decrease <- nrow( filter(counties_female_increase, (males_2012 - males_2002) < 0) ) # Convert your list to a data frame, and write the results # to the file `binge_info.csv` write.csv( data.frame(answer_list), file = "./output/binge_info.csv", row.names = FALSE ) # The next questions return *data frames as results*: # What is the *minimum* level of binge drinking in each state in 2012 for # both sexes (across the counties)? Your answer should contain roughly 50 values # (one for each state), unless there are two counties in a state with the # same value. Your answer should be a *dataframe* with the location, state, and # 2012 binge drinking rate. Write this to a file called `min_binge.csv`. min_binge <- group_by(county_only, state) %>% filter(both_sexes_2012 == min(both_sexes_2012)) %>% select(state, location, both_sexes_2012) write.csv( data.frame(min_binge), file = "./output/min_binge.csv", row.names = FALSE ) # What is the *maximum* level of binge drinking in each state in 2012 for # both sexes (across the counties)? Your answer should contain roughly 50 values # (one for each state), unless there are two counties in a state with the # same value. Your answer should be a *dataframe* with the location, state, and # 2012 binge drinking rate. Write this to a file called `max_binge.csv`. max_binge <- group_by(county_only, state) %>% filter(both_sexes_2012 == max(both_sexes_2012)) %>% select(state, location, both_sexes_2012) write.csv( data.frame(max_binge), file = "./output/max_binge.csv", row.names = FALSE ) ################################# Joining Data ################################# # You'll often have to join different datasets together in order to ask more # involved questions of your dataset. In order to join our datasets together, # you'll have to rename their columns to differentiate them. # First, rename all prevalence columns in the any_drinking dataset to the # have prefix "any_" (i.e., `males_2002` should now be `any_males_2002`) # Hint: you can get (and set!) column names using the colnames function. # This may take multiple lines of code. colnames(any_drinking)[3:35] <- paste0("any_", colnames(any_drinking)[3:35]) # Then, rename all prevalence columns in the binge_drinking dataset to the have # the prefix "binge_" (i.e., `males_2002` should now be `binge_males_2002`) # This may take multiple lines of code. colnames(binge_drinking)[3:35] <- paste0( "binge_", colnames(binge_drinking)[3:35]) # Then, create a dataframe by joining together the both datasets. # Think carefully about the *type* of join you want to do, and what the # *identifying columns* are. You will use this (joined) data to answer the # questions below. joined_drinking <- left_join(binge_drinking, any_drinking) # Create a column `diff_2012` storing the difference between `any` and `binge` # drinking for both sexes in 2012 joined_drinking <- mutate( joined_drinking, diff_2012 = (any_both_sexes_2012 - binge_both_sexes_2012) ) # Which location has the greatest *absolute* difference between `any` and # `binge` drinking? Your answer should be a one row data frame with the state, # location, and column of interest (diff_2012). # Write this dataframe to `biggest_abs_diff_2012.csv`. biggest_abs_diff_2012 <- filter( joined_drinking, diff_2012 == max(abs(diff_2012)) ) write.csv( data.frame(biggest_abs_diff_2012), file = "./output/biggest_abs_diff_2012.csv", row.names = FALSE ) # Which location has the smallest *absolute* difference between `any` and # `binge` drinking? Your answer should be a one row data frame with the state, # location, and column of interest (diff_2012). # Write this dataframe to `smallest_abs_diff_2012.csv`. smallest_abs_diff_2012 <- filter( joined_drinking, diff_2012 == min(abs(diff_2012)) ) write.csv( data.frame(smallest_abs_diff_2012), file = "./output/smallest_abs_diff_2012.csv", row.names = FALSE ) ############## Write a function to ask your own question(s) #################### # Even in an entry level data analyst role, people are expected to come up with # their own questions of interest (not just answer the questions that other # people have). For this section, you should *write a function* that allows you # to ask the same question on different subsets of data. For example, you may # want to ask about the highest/lowest drinking level given a state or year. # The purpose of your function should be evident given the input parameters and # function name. After writing your function, *demonstrate* that the function # works by passing in different parameters to your function. # My question is "What is the location with the highest drinking level given a particular year? highest_drinking_male <- function (year) { index <- 3 * (year - 2002) + 5 highest_drinking <- filter( any_drinking, any_drinking[index] == max(any_drinking[index]) ) highest_drinking[1, 2] } highest_drinking_male(2012) ################################### Challenge ################################## # Using your function from part 1 that wrote a .csv file given a state name, # write a separate file for each of the 51 states (including Washington D.C.) # The challenge is to do this in a *single line of (very concise) code* lapply(any_drinking$state, get_state_data) # Write a function that allows you to pass in a *dataframe* (i.e., in the format # of binge_drinking or any_drinking) *year*, and *state* of interest. The # function should saves a .csv file with observations from that state's counties # (and the state itself). It should only write the columns `state`, `location`, # and data from the specified year. Before writing the .csv file, you should # *sort* the data.frame in descending order by the both_sexes drinking rate in # the specified year. The file name should have the format: # `DRINKING_STATE_YEAR.csv` (i.e. `any_Utah_2005.csv`). # To write this function, you will either have to use a combination of dplyr # and base R, or confront how dplyr uses *non-standard evaluation* # Hint: https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html csv_state_year <- function(dataframe, year, state_name) { drinking_type <- gsub("_.*", "", colnames(dataframe)[3]) dataframe_sorted <- select_( dataframe, "state", "location", paste0(drinking_type, "_males_", year), paste0(drinking_type, "_females_", year), paste0(drinking_type, "_both_sexes_", year) ) %>% arrange_(paste0( "-", drinking_type, "_both_sexes_", year) ) %>% filter(state == state_name) write.csv( dataframe_sorted, file = paste0( "./output/", drinking_type, "_", state_name, "_", year, ".csv" ) ) } # Create the file `binge_Colorado_2007.csv` using your function. csv_state_year(binge_drinking, "2007", "Colorado")