Question

In: Computer Science

Discuss the issues involved in sorting large data files. Explain how indexing is used to overcome...

Discuss the issues involved in sorting large data files. Explain how indexing is used to overcome this issue.

Explain the difference between physical order and logical order

Expert Solution

Why sorting: In database it is easy to get records if the records are already sorted. in sorted record binary search can be applied of order O(logn) while unsorted will use linear search of order o(n).

Sorting a large file: sorting is a process in which the records are sorted depending upon the key value in either accending or decending order. The best sorting algorithm in O(nlog) and worst is O(). When a large database is to be sorted multiple time with based on different attribute then using this process the latency will increase and it will not be enough to work with database.

example: let's have a table with 500 column(attributes) and 2 billion rows(tuples). To sort the table in any one of the given column minimum 2*log2 billion operations are required.

This problem can be reduced using b+ trees which are form of mulltiple indexing.

What is indexing: lets take an example to understand indexing, a book may contain 10,000 pages but to index those pages we only require 1 or 2 pages. i.e searching space is reduced to 1 or 2 from 10,000 this is what indexing does.

There are three types of indexing:

Primary indexing: data is sorted based on primary key(sorted) of table.
clustured indexing: data is sorted based on group of non-key attributes(sorted).
secondary indexing: data is sorted based on any attribute which is unsorted.

Let's take an example and understand cost of indexing and without indexing.

say you have 30,000 records, each of record is of 100 bytes, block size if 1024 bytes. how many blocks need to be search to get a record in indexed data and non-indexed data.

Without indexing:

one block will accomodate = block size/data size = 1024/100 = 10 some extra space left in block to be sparce indexing.

total block requird to accomodate 30,000 records = total blocks/records per block = 30,000/10 = 3,000 blocks.

if indexing not used then we have to search through all the block to get the desired data in linear time O(n). i.e 3000

With Indexing:

here we have to maintain an index file same as book does to store title and page no. This index file will contain key and refrence to the block containing key. so size of index record will be less as compared to table as it contain only two colums and table can have many colums.

NOTE: as block size is fixed for any hardware. so it will be fixed for indexing and non-indexing.

Now we can accomodate large no of index records in one block as compared on table record. assume one index recored size if 15 bytes.

one block will accomodate = block size/index entry size = 1024/15 = 68 entries of index file.

block required to index the 3,000 block of table = total block/entries per block = 3,000/68 = 44.11 so 45 index blocks are required to index 3,000 block of table.

Now to search a record in the table first find the block using index then find the record in refrenced block. As index is always sorted apply binary search O(logn) =log(45) = 6 i.e at max i have to search 6 entries to get the desired block when block is found then in constant time(1) find record in that block. so, 6+1 = 7

Hence problem is reduced from 3,000 to 7 from non-indexing to indexing. thats why indexing is used.

Physical and virtual order: physical order is the order in which data are stored at hardware level while virtual order is order in which we retrieve the data according to our requirement( i.e indexing).

Happy Learning :)

kindly rate!

venereology answered 8 months ago

Explain how empowerment and participation are used in motivation and to overcome resistance to change.

Explain the process of creating flat (text) files and how to store/retrieve data.

Discuss three major ethical issues that are involved with organ transplantation.

Explain how comparative genomics can be used to find the genes involved in production of a...

Explain how comparative genomics can be used to find the genes involved in production of a particular trait (you can use any study species that interest you as a hypothetical example).

Explain how a letter of credit is used with an example for all 3 parties involved....

Explain how a letter of credit is used with an example for all 3 parties involved. Be Detailed in the process/ chain linking all parties

Which statement is NOT correct? The Virtual storage Access Method a. is used for very large files ..

Which statement is NOT correct? The Virtual storage Access Methoda. is used for very large files that need both direct access and batch processing.b. may use an overflow area for records.c. provides an exact physical address for each record.d. is appropriate for files that require few insertions or deletions.

Discuss the issues involved when recording and transalating foreign curency transactions.

Discuss the theoretical issues involved in the privatisation debate. Further. comment on this issue on the...

Discuss the theoretical issues involved in the privatisation debate. Further. comment on this issue on the basis of theoretical points and empirical evidence from India and abroad. ANSWER IN ABOUT 2000 WORDS.

Explain the issues involved with the Fed acting as a Lender of Last Resort (LLR). Explain...

Explain the issues involved with the Fed acting as a Lender of Last Resort (LLR). Explain the concept of Moral hazard and its impact

For each of the following biological issues explain: (a) possible ethical issues involved; (b) two sides...

For each of the following biological issues explain: (a) possible ethical issues involved; (b) two sides or views of the issue; (c) possible separation of scientific versus societal view and opinions. 1. Cloning 3. Rare and endangered species conservation 4. Sea level rise