In: Advanced Math
OK I have two data sets with 30 million rows each each data set is five columns with four attributes and an amount. I want to confirm that the two data sets are exactly the same no two rows of data in the 30 million rolls are duplicates
For my proof I will confirm each data set has the same number of rows. And I will also do the following:
I will create four smaller data sets from each of the two large data sets. Each of the smaller data sets will remove one of the four attributes
If each of those for data sets matches exactly and the total count matches is that proof that the two large data sets are exactly the same
The reason I am doing this test as I am not able to compare 30 million rows to 30 million rows because the set is too large for the tools I have available
QUESTION: If the four smaller data sets match exactly and the total row count matches exactly. Have I proved these the two 30 million row data sets are exactly the same l?
Indeed it proves so. But you have to be a bit careful. There might be permutations of attribute columns in the data sets. What I mean is, let your data sets be and with columns and (I am assuming all the columns have the same number of data points).Then there might exist a permutation such that . Then your data sets are the same up to a permutation and your algorithm may or may not face problems, but here is a safer alternative (also your method is a bit redundant, eating up more space and time).
Partition your data sets into smaller data sets, each containing just 1 column. Then check to see if a column in the partition of is also in the partition of . If yes, remove that column from the partition of and and repeat with another column. If no, then the data sets are not the same. If the process exhausts all the columns of and then you are done. If not, then they are not the same.
This is safer because permutations don't matter here since every column is checked with every column (this also has a lower time and obviously lower space complexity).