Question

In: Computer Science

Managing Data (R language) Reference (Chapter : 4 Managing Data from textbook - Practical Data Science...

Managing Data (R language)

Reference (Chapter : 4 Managing Data from textbook - Practical Data Science With R, 1st edition By Nina Zumel and John Mount
Publisher: Manning, ISBN 13: 978-1-617291-56-2)

1 . Explain how you would handle missing data in categorical and numerical variables?

2. Give few data transformation techniques and cases where you would be applying them.

3. Briefly explain the log transformation and when it should be used.

Solutions

Expert Solution

Ans1:

There is various ways to handle missing values of categorical ways.

  1. Ignore observations of missing values if we are dealing with large data sets and less number of records has missing values
  2. Ignore variable, if it is not significant
  3. Develop model to predict missing values
  4. Treat missing data as just another category

In case of missing values for numeric variables, we perform following steps to handle it.

  1. Ignore these observations
  2. Replace with general average
  3. Replace with similar type of averages
  4. Build model to predict missing values

Ans:2

The logarithm and square root transformations are commonly used for positive data, and the multiplicative inverse (reciprocal) transformation can be used for non-zero data

They are used in following cases:-

1. Data transformation is applied is when a value of interest ranges is over several orders of magnitude .

2.Transforming to normality

3.Transforming to a uniform distribution or an arbitrary distribution

4.Variance stabilizing transformations

Ans:3

The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality. If the original data follows a log-normal distribution or approximately so, then the log-transformed data follows a normal or near normal distribution.

The log transformation can be used to make highly skewed distributions less skewed. This can be valuable both for making patterns in the data more interpretable and for helping to meet the assumptions of inferential statistics.


Related Solutions

Please refer to the question from textbook Stats: Data and Models 4th Edition, Chapter 4, Exercises,...
Please refer to the question from textbook Stats: Data and Models 4th Edition, Chapter 4, Exercises, Section 4.5 Chapter exercises #22 Camp Sites, part (b): How many parks would you classify as outliers ? Explain
The question is from the textbook Inequality, Discrimination, Poverty, and Mobility Question 4, Chapter 15: "Is...
The question is from the textbook Inequality, Discrimination, Poverty, and Mobility Question 4, Chapter 15: "Is there a healthy degree of intergenerational mobility in the United States or is the degree of intergenerational mobility distressingly low?"
From Chapter 4 Case Study 2- An Introduction to Management Science- A Qualitative Approach to Decision...
From Chapter 4 Case Study 2- An Introduction to Management Science- A Qualitative Approach to Decision Making 14e Schneider's sweet shop specializes in homade candies and ice cream. Schneider produces its ice cream in-house, in batches of 50 pounds. The first stage in ice cream making is blending of the ingredients to obtain a mix which meets pre-specified requirements on the percentages of certain constituents of the mix. The desired composition is as follows 1. Fat 16% 2. Serum Solids...
In David Crystal's book "Language and the Internet" in chapter 4 (The language of e-mail) what...
In David Crystal's book "Language and the Internet" in chapter 4 (The language of e-mail) what is the main idea in this chapter?
Chapter 11, 12: Money and Inflation Reference: Brief Principles of Macroeconomics textbook III. Define the following...
Chapter 11, 12: Money and Inflation Reference: Brief Principles of Macroeconomics textbook III. Define the following costs of high inflation and find examples. 1. Menu costs 2. Shoeleather costs 3. Confusion and inconvenience 4. Distortions in relative prices and the allocation of resources 5. Tax distortions 6. Arbitrary redistributions of wealth
Mubarak textbook Construction Project Scheduling and Control - 3rd edition(Chapter 11) can be used as reference,...
Mubarak textbook Construction Project Scheduling and Control - 3rd edition(Chapter 11) can be used as reference, but problem not from textbook PROBLEM 2 2. A project team developed a detailed construction plan that includes activities with uncertain durations, including precedence constraints. The plan was simulated 10 times, and using a significance level (α) of 0.05, the expected project duration was found to be: 365 ± 47 days In other words, the true mean duration could be anywhere between 318 days...
Regarding the book"ESSENTIAL CYBER SECURITY SCIENCE by JOSIAH DYKSTRA, Chapter 4 & Chapter 5 ., especially...
Regarding the book"ESSENTIAL CYBER SECURITY SCIENCE by JOSIAH DYKSTRA, Chapter 4 & Chapter 5 ., especially CHapter 4 & Ch-5 ONLY... Will you please describe in detail the hardware and software used in those chapter (4&5 only)? Thanks
Please answer the following questions. Answers can be found in Chapter 4 of the textbook and...
Please answer the following questions. Answers can be found in Chapter 4 of the textbook and in the Chapter 4 lecture notes. Submit your answers as an attachment on Canvas 1. What is the role of the jury? 2. What is the role of the judge? 3. What are the two COURT SYSTEMS in the US? 4. What are the two TYPES of courts in those court systems? 5. What are the two types of law considered by the two...
David Crystal's "Language and the internet "What is the main idea in chapter 4 ?
David Crystal's "Language and the internet "What is the main idea in chapter 4 ?
I have an assignment that must be done on R/R studio R/RStudio - Chapter 4- Correlation...
I have an assignment that must be done on R/R studio R/RStudio - Chapter 4- Correlation and Regression Step 1 – Download and Install the R/RStudio software on your computer. The link and instructions to do that are on Blackboard under Course Information. Step 2 – Access the Data Sets in MyLab through StatCrunch. In chapter 4, a data set called “Used BMW prices 2017” is what we’ll use. Step 3 – When you open the data set in StatCrunch,...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT