In: Statistics and Probability
Business Statistics
Common Mistakes in Statistical Studies:
Top 6 most common statistical errors made by data scientists
Data scientists are the rare breed of professionals who can solve the world’s thorniest problems. The data savvy professionals are believed to be a rare combination of statistical and computational ingenuity, however, these data pros are also prone to mistakes. While we have dived into the makings of a data scientists and covered the topic extensively, it is time to train the gaze on the six most common statistical mistakes data scientists make. Some of the most common errors are the types of measurements, variability of data and the sample size. Statistics provides the answers but in some cases it confuses too.
Correlation is not causation
According to leading data science veteran and co-author Data Science for Business Tom Fawcett, the underlying principle in statistics and data science is the correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is apparently the most common mistake in Time Series. Fawcett cites an example of a stock market index and the unrelated time series Number of times Jennifer Lawrence was mentioned in the media. The lines look amusingly similar. There is usually a statement like: “Correlation = 0.86”. Recall that a correlation coefficient is between +1 (a perfect linear relationship) and -1 (perfectly inversely related), with zero meaning no linear relationship at all. 0.86 is a high value, demonstrating that the statistical relationship of the two time series is strong.Fawcett goes on to add that when exploring relationships between two time series, all one wants to know is whether the variations in one series are correlated with variations in another.
Biased Data
We have heard of biased algorithms, but there is bias data as well. We are talking about biased sampling that can lead to measurement errors because of unrepresentative samples. In most cases, data scientists can arrive at results that are close but not accurate due to biased estimators. An estimator is the rule for calculating an estimate of a given quantity based on the observed data. In fact, non-random samples are believed to be biased, and their data cannot be used to represent any other population beyond themselves.
Regression Error
In basic linear or logistic regression, mistakes arise from not knowing what should be tested on the regression table. In regression analysis, one identifies the dependent variable that varies based on the value of the independent variable. The first step here is to specify the model by defining the response and predictor variables. And most data scientists trip up here by mispecifying the model. In order to avoid the model misspecification, one must find out if there is any functional relationship between the variables that are being considered.
Misunderstanding P Value
Long pegged as the ‘gold standard’ of statistical validity, P values are a nebulous concept and scientists believes that aren’t as reliable as many researchers assume. P value are used to determine statistical significance in a hypothesis test. According to the American Statistical Association, P value do not measure the probability that the studied hypothesis is true, or the probability that the data was produced by random chance alone. Hence, business and organizational decisions should not be based only on whether a p-value passes a specific threshold. Many believe that data manipulation and significance chasing can make it impossible to come to the right conclusions from findings.
Inadequate Handling of Outliers and Influential Data Points
Outliers can affect any statistical analysis, thereby outlier should be investigated and deleted, corrected, or explained as appropriate. For auditable work, the decision on how to treat any outliers should be documented. Sometimes loss of information may be a valid tradeoff in return for enhanced comprehension.
Loss of information
The main object of statistical data analysis is to provide the best business outcome, with minimal modeling or human bias. Sometime, a loss of information in individual data points can impact the result and its relationship with data set.
1.) Choose 3 common mistakes, summarize the mistakes and how the material from this course can help you not make or fall for the mistakes when others make them.
common mistakes
1. Misunderstandings about Probability
There are many types of misunderstandings about probability. These are just a sample of some of the more common or problematical ones.
2. Errors in Sampling
A sampling method is called biased if it systematically favors some outcomes over others. Sampling bias is sometimes called ascertainment bias (especially in biological fields) or systematic bias.
Example:
Telephone sampling is common in marketing surveys. A simple random
sample may be chosen from the sampling frame consisting of a list
of telephone numbers of people in the area being surveyed. This
method does involve taking a simple random sample, but it is
not a simple random sample of the target
population
3. Mistakes in Thinking About Causation
Consider elementary school students' shoe sizes and scores on a standard reading exam. They are correlated, but saying that larger shoe size causes higher reading scores is as absurd as saying that high reading scores cause larger shoe size.
Suggestions for Researchers
1. for planning research:
2. for analyzing data:
3. for writing up research