In: Computer Science
A sequential file contains at most four billion 32-bit integers in random order, we need to find a 32-bit integer that isn’t in the file. Note that certainly we can simply read them all into memory (if we have enough memory) and sort them. That would cost O(n log n) for sorting and O(n) for searching. Think if we do better than that WITHOUT sorting and answer the following questions with better solutions.
(a) Explain why there must be at least one such integer missing?
(b) If you have some memory that is enough to hold one bit for each such integer, i.e., with such bitmap approach, how would you solve the problem?
(c) If you only have a few hundred bytes of main memory but you could use several external files, how would you solve it? (hint: use binary search)
Answer a
32-bit integer means an integer value can be from 0 till 232-1.
32-bit Integer value range = 0 to 4,294,967,295 means more than 4 billion.
So if we have file having 4 billion integers in random order, obviously there will be many numbers missing in the file. To have all numbers supported by 32-bit integer file must have each and every value from 0 to 4,294,967,295 which is not the case.
So we can say for sure at least one number is missing.
Answer b
If we need to solve the missing numbers problem we can have bitmap where each bit indicating whether particular number is available in file or not.
so we need to have 232 bits = 4,294,967,296 bits which means 512 MB memory.
We can create bit map for 4,294,967,296 entries and initialize it to 0. We will scan through the file. For any number found,we mark corresponding bit as 1. If we find same number again we don't change anything. At the end we scan the bit map and list out the entries with value 0. Those all are missing numbers.
Answer c
If we have few hundreds bytes of memory. We can create bit map accordingly. And in multiple scan of files we can find out which numbers are missing.
For example 512 Bytes means we can have bit map for 4096 integers. In first scan if we find any number from 0 to 4095 we store the bit value. After the scan we scan the bit map to find integers missing based on 0 value of bit. In 2nd scan we can store again any number from 4096 - 8191. Same way we can scan full file in multiple scans.