In: Computer Science
Discuss the issues involved in sorting large data
files. Explain how indexing is used to overcome this
issue.
Explain the difference between physical order and
logical order
Why sorting: In database it is easy to get records if the records are already sorted. in sorted record binary search can be applied of order O(logn) while unsorted will use linear search of order o(n).
Sorting a large file: sorting is a process in which the records are sorted depending upon the key value in either accending or decending order. The best sorting algorithm in O(nlog) and worst is O(). When a large database is to be sorted multiple time with based on different attribute then using this process the latency will increase and it will not be enough to work with database.
example: let's have a table with 500 column(attributes) and 2 billion rows(tuples). To sort the table in any one of the given column minimum 2*log2 billion operations are required.
This problem can be reduced using b+ trees which are form of mulltiple indexing.
What is indexing: lets take an example to understand indexing, a book may contain 10,000 pages but to index those pages we only require 1 or 2 pages. i.e searching space is reduced to 1 or 2 from 10,000 this is what indexing does.
There are three types of indexing:
Let's take an example and understand cost of indexing and without indexing.
say you have 30,000 records, each of record is of 100 bytes, block size if 1024 bytes. how many blocks need to be search to get a record in indexed data and non-indexed data.
Without indexing:
one block will accomodate = block size/data size = 1024/100 = 10 some extra space left in block to be sparce indexing.
total block requird to accomodate 30,000 records = total blocks/records per block = 30,000/10 = 3,000 blocks.
if indexing not used then we have to search through all the block to get the desired data in linear time O(n). i.e 3000
With Indexing:
here we have to maintain an index file same as book does to store title and page no. This index file will contain key and refrence to the block containing key. so size of index record will be less as compared to table as it contain only two colums and table can have many colums.
NOTE: as block size is fixed for any hardware. so it will be fixed for indexing and non-indexing.
Now we can accomodate large no of index records in one block as compared on table record. assume one index recored size if 15 bytes.
one block will accomodate = block size/index entry size = 1024/15 = 68 entries of index file.
block required to index the 3,000 block of table = total block/entries per block = 3,000/68 = 44.11 so 45 index blocks are required to index 3,000 block of table.
Now to search a record in the table first find the block using index then find the record in refrenced block. As index is always sorted apply binary search O(logn) =log(45) = 6 i.e at max i have to search 6 entries to get the desired block when block is found then in constant time(1) find record in that block. so, 6+1 = 7
Hence problem is reduced from 3,000 to 7 from non-indexing to indexing. thats why indexing is used.
Physical and virtual order: physical order is the order in which data are stored at hardware level while virtual order is order in which we retrieve the data according to our requirement( i.e indexing).
Happy Learning :)
kindly rate!