Question

In: Statistics and Probability

Business Statistics Common Mistakes in Statistical Studies: Top 6 most common statistical errors made by data...

Business Statistics

Common Mistakes in Statistical Studies:

Top 6 most common statistical errors made by data scientists

Data scientists are the rare breed of professionals who can solve the world’s thorniest problems. The data savvy professionals are believed to be a rare combination of statistical and computational ingenuity, however, these data pros are also prone to mistakes. While we have dived into the makings of a data scientists and covered the topic extensively, it is time to train the gaze on the six most common statistical mistakes data scientists make. Some of the most common errors are the types of measurements, variability of data and the sample size. Statistics provides the answers but in some cases it confuses too.

Correlation is not causation

According to leading data science veteran and co-author Data Science for Business Tom Fawcett, the underlying principle in statistics and data science is the correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is apparently the most common mistake in Time Series. Fawcett cites an example of a stock market index and the unrelated time series Number of times Jennifer Lawrence was mentioned in the media. The lines look amusingly similar. There is usually a statement like: “Correlation = 0.86”. Recall that a correlation coefficient is between +1 (a perfect linear relationship) and -1 (perfectly inversely related), with zero meaning no linear relationship at all. 0.86 is a high value, demonstrating that the statistical relationship of the two time series is strong.Fawcett goes on to add that when exploring relationships between two time series, all one wants to know is whether the variations in one series are correlated with variations in another.

Biased Data

We have heard of biased algorithms, but there is bias data as well. We are talking about biased sampling that can lead to measurement errors because of unrepresentative samples. In most cases, data scientists can arrive at results that are close but not accurate due to biased estimators. An estimator is the rule for calculating an estimate of a given quantity based on the observed data. In fact, non-random samples are believed to be biased, and their data cannot be used to represent any other population beyond themselves.

Regression Error

In basic linear or logistic regression, mistakes arise from not knowing what should be tested on the regression table. In regression analysis, one identifies the dependent variable that varies based on the value of the independent variable. The first step here is to specify the model by defining the response and predictor variables. And most data scientists trip up here by mispecifying the model. In order to avoid the model misspecification, one must find out if there is any functional relationship between the variables that are being considered.

Misunderstanding P Value

Long pegged as the ‘gold standard’ of statistical validity, P values are a nebulous concept and scientists believes that aren’t as reliable as many researchers assume. P value are used to determine statistical significance in a hypothesis test. According to the American Statistical Association, P value do not measure the probability that the studied hypothesis is true, or the probability that the data was produced by random chance alone. Hence, business and organizational decisions should not be based only on whether a p-value passes a specific threshold. Many believe that data manipulation and significance chasing can make it impossible to come to the right conclusions from findings.

Inadequate Handling of Outliers and Influential Data Points

Outliers can affect any statistical analysis, thereby outlier should be investigated and deleted, corrected, or explained as appropriate. For auditable work, the decision on how to treat any outliers should be documented. Sometimes loss of information may be a valid tradeoff in return for enhanced comprehension.

Loss of information

The main object of statistical data analysis is to provide the best business outcome, with minimal modeling or human bias. Sometime, a loss of information in individual data points can impact the result and its relationship with data set.

1.) Choose 3 common mistakes, summarize the mistakes and how the material from this course can help you not make or fall for the mistakes when others make them.

Solutions

Expert Solution

common mistakes

1. Misunderstandings about Probability

There are many types of misunderstandings about probability. These are just a sample of some of the more common or problematical ones.

  • Misunderstandings arising from different perspectives on probability
  • Misunderstandings arising from lack of clarity about the reference category
  • Misunderstandings involving different uses of the word "risk"
  • Misunderstandings involving conditional probabilities

2. Errors in Sampling

A sampling method is called biased if it systematically favors some outcomes over others. Sampling bias is sometimes called ascertainment bias (especially in biological fields) or systematic bias.

Example:
Telephone sampling is common in marketing surveys. A simple random sample may be chosen from the sampling frame consisting of a list of telephone numbers of people in the area being surveyed. This method does involve taking a simple random sample, but it is not a simple random sample of the target population

3. Mistakes in Thinking About Causation

Consider elementary school students' shoe sizes and scores on a standard reading exam. They are correlated, but saying that larger shoe size causes higher reading scores is as absurd as saying that high reading scores cause larger shoe size.

Suggestions for Researchers

1. for planning research:

  • Decide what questions you will be studying.
    • Trying to study too many things at once is likely to create problems with multiple testing, so it may be wise to limit your study.
  • If you will be gathering data, think about how you will gather and analyze it before you start to gather the data.

2. for analyzing data:

  • Before doing any formal analysis, ask whether or not the model assumptions of the procedure are plausible in the context of the data.
  • Plot the data (or residuals, as appropriate) as possible to get additional checks on whether or not model assumptions hold.

3. for writing up research

  • Aim for transparency and reproducibility.
  • When citing sources, give explicit page numbers, especially for books.
  • Include discussion of why the analyses used are appropriate

Related Solutions

Describe the 12 most common mistakes made by new exporters and then identify and describe sources...
Describe the 12 most common mistakes made by new exporters and then identify and describe sources of export information, counseling and support which exporters can use to minimize these mistakes.
Bigger sample is not always better – one of the worst statistical mistakes ever made in...
Bigger sample is not always better – one of the worst statistical mistakes ever made in the history of statistics happened in the 1936 U.S. Presidential Election. The incumbent president Franklin Roosevelt of the Democratic Party and Alf Landon of the Republican Party are the two main presidential candidates. Look the event up at this webpage http://www.math.upenn.edu/~deturck/m170/wk4/lecture/case1.html and answer the following questions. Find the Literary Digest’s pre-election prediction for Roosevelt and Landon, in percentage of popularity vote. (One percentage for...
What are the most common errors for CSI (crime scene investigators)? What common errors that often...
What are the most common errors for CSI (crime scene investigators)? What common errors that often happen REAL life not TV shows. Please provide me sources for your answer!! THANK YOU
Correct mistakes in the following program: /* BUG ZONE!!! Example: some common pointer errors */ #include...
Correct mistakes in the following program: /* BUG ZONE!!! Example: some common pointer errors */ #include <stdio.h> main() { int i = 57; float ztran4; int track[] = {1, 2, 3, 4, 5, 6}, stick[2][2]; int *nsave; /* Let's try using *nsave as an int variable, and set it to 38 */ *nsave = 38; /* BUG */ nsave = NULL; *nsave = 38; /* BUG */ nsave = 38; /* BUG */ &nsave = 38; /* BUG */ nsave...
Discuss the four common service recovery mistakes made by organizations. Give examples.
Discuss the four common service recovery mistakes made by organizations. Give examples.
-What are the most common medication errors? -What are risk factors for medication errors? -When can...
-What are the most common medication errors? -What are risk factors for medication errors? -When can medication errors occur? -Why must each error be thoroughly investigated and documented? -What are the advantages of documenting medication errors? You will need to cite your resources.
The most common abuse of correlation in studies is to confuse the concepts of correlation with...
The most common abuse of correlation in studies is to confuse the concepts of correlation with those of causation. Good SAT scores do not cause good college grades, for example. Rather, there are other variables, such as good study habits and motivation, that contribute to both. Find an example of an article that confuses correlation and causation. Discuss other variables that could contribute to the relationship between the variables.
Note: Strictly no copy paste, write in your language. Q. What are the most common mistakes...
Note: Strictly no copy paste, write in your language. Q. What are the most common mistakes in implementing Lean?
Comparing Statistical Studies Statistics are all around us, whether or not we notice them being used....
Comparing Statistical Studies Statistics are all around us, whether or not we notice them being used. From public policy, health, economics, science, culture to which foods a fast-food restaurant is going to serve next can all be influenced by how the results of statistical studies are operationalized and interpreted. Each chapter of your course text concludes with two "Focus on" sections that go into depth on important issues of our time. The topics of these sections were chosen to demonstrate...
Top Billers processes invoices for business clients. Lately, Top Billers has been getting complaints about errors...
Top Billers processes invoices for business clients. Lately, Top Billers has been getting complaints about errors in the invoices they have processed. The quality manager for Top Billers decides to select one client at random each day and sample all of that client’s invoices to measure the extent of the problem. a)   If the manager decides to construct a u-chart, what values will be used for the center line, lower control limit, and the minimum and maximum values of the...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT