Question

In: Computer Science

2. Many statistical tests for outliers were developed in an environment in which a few hundred...

2. Many statistical tests for outliers were developed in an environment in which a few hundred observations was a large data set. With this question, we discuss the limitations of such approaches.

(a) For a set of 1,000,000 values, how many outliers we would have according to the test that says a value is an outlier if it is more than three standard deviations from the average? (Assume a normal distribution.)

(b) Does the approach that states an outlier is an object of unusually low probability need to be adjusted when dealing with large data sets? If so, how?

Solutions

Expert Solution

2)a)Choosing a N(0, 1) distribution for simplicity, α = prob(|x| ≥ c) gives (c, α) = (3, 0.0027). In this case, there is a 0.0027 probability of having an outlier outside the symmetric interval (−c, c). For 1,000,000 values, this should correspond to 2700 outliers.

b)Increasing the threshold to 4 standard deviations would make α = 6.3 · 10−5 , corresponding to only 63 outliers. In experimental particle physics, one typically sets a threshold of 5 standard deviations, with α = 5.7 · 10−7 , which should be halved when testing for the significance of an event/outlier (one-sided).When dealing with large data set the analysis can have a different approach in classifying outliers. Outlier is defined by its distance from the central object data set in comparison to other objects in the data set. Depending on the precision of the data set and outlier can be at future distance from the central of the data set greater than a defined radial distance.

*******************************************************************************************************************

In case of any doubt do ask in the comment section.Hope you like it


Related Solutions

Which statistical tests would be used to determine statistical significance, and which statistical tests would be...
Which statistical tests would be used to determine statistical significance, and which statistical tests would be used to determine the variance from a population mean? Give a brief description of their key differences.
QUESTION 2 Which causal criterion is established in an experiment by tests of statistical significance? a....
QUESTION 2 Which causal criterion is established in an experiment by tests of statistical significance? a. association b. direction of influence c. nonspuriousness (elimination of rival explanations) 7.7 points    QUESTION 3 What is the purpose of tests of statistical significance in an experiment? a. to establish direction of influence b. to determine if random assignment created similar experimental and control groups c. to determine the generalizability of the findings d. to determine if chance is a reasonable explanation of experimental...
QUESTION 2 Which causal criterion is established in an experiment by tests of statistical significance? a....
QUESTION 2 Which causal criterion is established in an experiment by tests of statistical significance? a. association b. direction of influence c. nonspuriousness (elimination of rival explanations) 7.7 points    QUESTION 3 What is the purpose of tests of statistical significance in an experiment? a. to establish direction of influence b. to determine if random assignment created similar experimental and control groups c. to determine the generalizability of the findings d. to determine if chance is a reasonable explanation of experimental...
Many standard statistical methods that you will study in Part II of this book are intended for use with distributions that are symmetric and have no outliers
  Many standard statistical methods that you will study in Part II of this book are intended for use with distributions that are symmetric and have no outliers. These methods start with the mean and standard deviation, x and s. For example, standard methods would typically be used for the IQ and GPA data here data211.dat.(a) Find x and s for the IQ data. (Round your answers to two decimal places.)x   =s   = (b) Find the median IQ score. It...
Is rapid economic growth affecting the environment? Several countries that were very poor only a few...
Is rapid economic growth affecting the environment? Several countries that were very poor only a few decades ago have been experiencing rapid economic growth, and at least portions of those countries are approaching levels of economic development that are comparable to those in rich countries. Is the rapid economic growth that has taken place in China, India, and other countries good for the environment or bad?
Two new tests were developed to test cerebrospinal fluid for presence of cytomegalovirus in neurologically diseased...
Two new tests were developed to test cerebrospinal fluid for presence of cytomegalovirus in neurologically diseased newborns. The table below shows the findings of this trial. Immunohistochemistry on brain material PCR Antigen capture ELISA Positive 49 52 45 Negative 223 220 227 Total 272 272 272 PCR: sensitivity 95.9%, specificity 97.8%. Antigen capture ELISA sensitivity: 89.8%, specificity 99.6%. (3 pts.) Choose ONE of the tests and complete a 2x2 table for it. (4 pts.) A county health department has chosen...
indicate which of the following statistical tests is best suited for the given research scenario. You...
indicate which of the following statistical tests is best suited for the given research scenario. You do not need to actually carry out the test (and in some cases, you haven’t been given enough information to do so); just tell us which test is the most appropriate. 1) A researcher is interested in the relationship between job satisfaction and stress. Within a large corporation, the researcher asked a random sample of workers two questions. The first question asked workers to...
indicate which of the following statistical tests is best suited for the given research scenario. You...
indicate which of the following statistical tests is best suited for the given research scenario. You do not need to actually carry out the test (and in some cases, you haven’t been given enough information to do so); just tell us which test is the most appropriate. On your answer sheet, simply indicate the letter associated with the appropriate test for each question: 1) We are interested in testing whether there is a difference in the average income between elderly...
Which of the following statistical tests is commonly used to test a hypothesis? a- chi-square test...
Which of the following statistical tests is commonly used to test a hypothesis? a- chi-square test b-Student t-test c- Z-test d- All of the above
1. Articulate the assumptions of the statistical test. 2. Paste SPSS output that tests those assumptions...
1. Articulate the assumptions of the statistical test. 2. Paste SPSS output that tests those assumptions and interpret them. Properly integrate SPSS output where appropriate. Do not string all output together at the beginning of the section. 3. Summarize whether or not the assumptions are met. If assumptions are not met, discuss how to ameliorate violations of the assumptions
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT