Question

In: Advanced Math

OK I have two data sets with 30 million rows each each data set is five...

OK I have two data sets with 30 million rows each each data set is five columns with four attributes and an amount. I want to confirm that the two data sets are exactly the same no two rows of data in the 30 million rolls are duplicates

For my proof I will confirm each data set has the same number of rows. And I will also do the following:

I will create four smaller data sets from each of the two large data sets. Each of the smaller data sets will remove one of the four attributes

If each of those for data sets matches exactly and the total count matches is that proof that the two large data sets are exactly the same

The reason I am doing this test as I am not able to compare 30 million rows to 30 million rows because the set is too large for the tools I have available

QUESTION: If the four smaller data sets match exactly and the total row count matches exactly. Have I proved these the two 30 million row data sets are exactly the same l?

Solutions

Expert Solution

Indeed it proves so. But you have to be a bit careful. There might be permutations of attribute columns in the data sets. What I mean is, let your data sets be and with columns and (I am assuming all the columns have the same number of data points).Then there might exist a permutation such that . Then your data sets are the same up to a permutation and your algorithm may or may not face problems, but here is a safer alternative (also your method is a bit redundant, eating up more space and time).

Partition your data sets into smaller data sets, each containing just 1 column. Then check to see if a column in the partition of is also in the partition of . If yes, remove that column from the partition of and and repeat with another column. If no, then the data sets are not the same. If the process exhausts all the columns of and then you are done. If not, then they are not the same.

This is safer because permutations don't matter here since every column is checked with every column (this also has a lower time and obviously lower space complexity).


Related Solutions

OK I have two data sets with 30 million rows each each data set is five...
OK I have two data sets with 30 million rows each each data set is five columns with four attributes and an amount. I want to confirm that the two data sets are exactly the same no two rows of data in the 30 million rolls are duplicates For my proof I will confirm each data set has the same number of rows. And I will also do the following: I will create four smaller data sets from each of...
Given two sets of data, A and B. i) Data set A has an r value...
Given two sets of data, A and B. i) Data set A has an r value of -.81 and data set B has an r value of .94 Describe the differences between the two data sets as completely as you can using the regression information we have learned. ii) Which linear regression equation, the one for A or the one for B, would probably be a better predictor? Why?
How can I reduce data set by deleting any rows that have all FALSE bool values...
How can I reduce data set by deleting any rows that have all FALSE bool values for every column in that row using pandas. Assuming there are 20+ columns/rows to loop through. Example: The table data below the pandas code should drop/reduce the data to remove the second & fifth row. True and False in the table are dtype bool. id Test1 value1 value2 value3 value4 0.1 1 False False False False 0.2 2 False True True False 0.3 3...
suppose you have two sets of data to work with.The first set is a list of...
suppose you have two sets of data to work with.The first set is a list of all the injuries that were seen in a clinic in a month's time.The second set contains data on the number of minutes that each patient spent in the waiting room of a doctor's office. Propose your idea of how to represent the key information.To organize your data would you choose to use a frequency table,a culmative frequency table, or avrelative frequency table?Why?
Healthcare data sets is an interesting topic. What are data sets? Why would a data set...
Healthcare data sets is an interesting topic. What are data sets? Why would a data set be developed? Provide one to two examples only not a list.
Where would I find five sets of data that produces a correlation of .56 between the...
Where would I find five sets of data that produces a correlation of .56 between the variables? Design a correlational study that will need two variables with at least five sets of data. between these two variables: time spent playing video games and aggression. Then in 500-750 words, do the following: Assume the study produces a correlation of .56 between the variables. Analyze three possible causal reasons for the relationship. Submit an SPSS output for the correlational study.
Suppose you have one data set with 30 cases, each case representing a student in this...
Suppose you have one data set with 30 cases, each case representing a student in this class. The following variables are available: age, gender/sex, race/ethnicity, class (freshman, sophomore, etc.), and GPA. For each of the 5 variables, explain (1) the level of measurement and (2) the measures of central tendency available to them. Race: Gender/sex: Race/ethnicity: Class: GPA:
Open the files for the Course Project and the data set. For each of the five...
Open the files for the Course Project and the data set. For each of the five variables, process, organize, present, and summarize the data. Analyze each variable by itself using graphical and numerical techniques of summarization. Use Excel as much as possible, explaining what the results reveal. Some of the following graphs may be helpful: stem-leaf diagram, frequency/relative frequency table, histogram, boxplot, dotplot, pie chart, and bar graph. Caution: not all of these are appropriate for each of these variables,...
I have Standard Deviation and Mean of 2 sets of data. Based on the data, how...
I have Standard Deviation and Mean of 2 sets of data. Based on the data, how can we infer at the 5% significance level that the score of individuals in the 4th year is better than the individuals in 1st year? average 71.29 76.98 S.D. 8.58 8.119 Year 1 Year 4 The sample size is 430
Design a correlational study, you will need two variables with at least five sets of data....
Design a correlational study, you will need two variables with at least five sets of data. between these two variables: time spent playing video games and aggression. My question: Assume the study produces a correlation of .56 between the variables. Analyze three possible causal reasons for the relationship.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT