Question

In: Statistics and Probability

OK I have two data sets with 30 million rows each each data set is five...

OK I have two data sets with 30 million rows each each data set is five columns with four attributes and an amount. I want to confirm that the two data sets are exactly the same no two rows of data in the 30 million rolls are duplicates

For my proof I will confirm each data set has the same number of rows. And I will also do the following:

I will create four smaller data sets from each of the two large data sets. Each of the smaller data sets will remove one of the four attributes

If each of those for data sets matches exactly and the total count matches is that proof that the two large data sets are exactly the same

The reason I am doing this test as I am not able to compare 30 million rows to 30 million rows because the set is too large for the tools I have available

QUESTION: If the four smaller data sets match exactly and the total row count matches exactly. Have I proved these the two 30 million row data sets are exactly the same l?

Solutions

Expert Solution

  


Related Solutions

Given two sets of data, A and B. i) Data set A has an r value...
Given two sets of data, A and B. i) Data set A has an r value of -.81 and data set B has an r value of .94 Describe the differences between the two data sets as completely as you can using the regression information we have learned. ii) Which linear regression equation, the one for A or the one for B, would probably be a better predictor? Why?
How can I reduce data set by deleting any rows that have all FALSE bool values...
How can I reduce data set by deleting any rows that have all FALSE bool values for every column in that row using pandas. Assuming there are 20+ columns/rows to loop through. Example: The table data below the pandas code should drop/reduce the data to remove the second & fifth row. True and False in the table are dtype bool. id Test1 value1 value2 value3 value4 0.1 1 False False False False 0.2 2 False True True False 0.3 3...
suppose you have two sets of data to work with.The first set is a list of...
suppose you have two sets of data to work with.The first set is a list of all the injuries that were seen in a clinic in a month's time.The second set contains data on the number of minutes that each patient spent in the waiting room of a doctor's office. Propose your idea of how to represent the key information.To organize your data would you choose to use a frequency table,a culmative frequency table, or avrelative frequency table?Why?
Healthcare data sets is an interesting topic. What are data sets? Why would a data set...
Healthcare data sets is an interesting topic. What are data sets? Why would a data set be developed? Provide one to two examples only not a list.
Where would I find five sets of data that produces a correlation of .56 between the...
Where would I find five sets of data that produces a correlation of .56 between the variables? Design a correlational study that will need two variables with at least five sets of data. between these two variables: time spent playing video games and aggression. Then in 500-750 words, do the following: Assume the study produces a correlation of .56 between the variables. Analyze three possible causal reasons for the relationship. Submit an SPSS output for the correlational study.
Suppose you have one data set with 30 cases, each case representing a student in this...
Suppose you have one data set with 30 cases, each case representing a student in this class. The following variables are available: age, gender/sex, race/ethnicity, class (freshman, sophomore, etc.), and GPA. For each of the 5 variables, explain (1) the level of measurement and (2) the measures of central tendency available to them. Race: Gender/sex: Race/ethnicity: Class: GPA:
Open the files for the Course Project and the data set. For each of the five...
Open the files for the Course Project and the data set. For each of the five variables, process, organize, present, and summarize the data. Analyze each variable by itself using graphical and numerical techniques of summarization. Use Excel as much as possible, explaining what the results reveal. Some of the following graphs may be helpful: stem-leaf diagram, frequency/relative frequency table, histogram, boxplot, dotplot, pie chart, and bar graph. Caution: not all of these are appropriate for each of these variables,...
I have Standard Deviation and Mean of 2 sets of data. Based on the data, how...
I have Standard Deviation and Mean of 2 sets of data. Based on the data, how can we infer at the 5% significance level that the score of individuals in the 4th year is better than the individuals in 1st year? average 71.29 76.98 S.D. 8.58 8.119 Year 1 Year 4 The sample size is 430
Design a correlational study, you will need two variables with at least five sets of data....
Design a correlational study, you will need two variables with at least five sets of data. between these two variables: time spent playing video games and aggression. My question: Assume the study produces a correlation of .56 between the variables. Analyze three possible causal reasons for the relationship.
A simple Statistic question by using R, If I have two set of mean proportion data,...
A simple Statistic question by using R, If I have two set of mean proportion data, what test should I use? such as, [1] 0.7652632 0.7555354 0.7602588 0.7594096 0.7497992 0.5532588 0.7595661 0.6911504 [9] 0.5964602 0.6369565 0.7355828 0.7346225 0.5913793 0.6499079 0.6327273 0.6091873 [17] 0.6306122 0.5960784 0.5492918 0.6785714 0.5014787 0.5484848 0.5645403 0.6731343 [25] 0.6208191 0.6087248 0.6045045 0.7743390 0.5275862 0.5731278 [1] 0.6564195 0.5928482 0.6806709 0.5546422 0.5438393 0.5906535 0.6764637 0.6487188 [9] 0.5901547 0.6626735 0.5955325 0.7462415 0.5971111 0.5731504 0.6334729 0.6124653 [17] 0.6224686 0.5549067 0.6348427 0.6265627...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT