Question

In: Statistics and Probability

Evaluate any two data reduction techniques with examples. What is the format for reporting results of...

Evaluate any two data reduction techniques with examples. What is the format for reporting results of each?

Solutions

Expert Solution

There are three types of data reduction techniques: feature reduction, case reduction and value reduction.

1. Feature reduction reduces the number of features (columns) in the data set through selection of the most relevant features or combination of two or more features into a single feature.

Possible feature reduction techniques are techniques such as principle components, heuristic feature selection with wrapper method and feature selection with decision trees.

2. Case reduction reduces the number of cases in a data set (rows) which is usually achieved through specialized sampling methods or sampling strategies.

Examples for case reduction techniques are incremental samples, average samples, increasing the sampling period and strategic sampling of key events.

3. Value reduction means reducing the number of different values a feature can take through grouping of values into a single category.

For value reduction prominent techniques are rounding, using k-means clustering and discretization using entropy minimization.

1. The best-known data reduction technique is data deduplication, which eliminates redundant data on storage systems. The deduplication process typically occurs at the storage block level. The system analyzes the storage to see if duplicate blocks exist, and gets rid of any redundant blocks. The remaining block is shared by any file that requires a copy of the block.

Data deduplication -- often called intelligent compression or single-instance storage -- is a process that eliminates redundant copies of data and reduces storage overhead. Data deduplication techniques ensure that only one unique instance of data is retained on storage media, such as disk, flash or tape.

Use the following commands and tools to help you evaluate data deduplication effectiveness:

Action Explanation
Use the QUERY STGPOOL server command to quickly check deduplication results. The Duplicate Data Not Stored field shows the actual reduction of data, in megabytes or gigabytes, and the percentage of reduction of the storage pool. For example, issue the following command:
query stgpool format=detailed
If the query is run before reclamation of the storage pool, the Duplicate Data Not Stored value is not accurate because it does not reflect the most recent data reduction. If reclamation did not yet take place, issue the following command to show the amount of data to be removed:
show deduppending backkuppool-file
Where backkuppool-file is the name of the deduplicated storage pool.
Use the QUERY OCCUPANCY server command. This command shows the logical amount of storage per file space when a file space is backed up to a deduplicated storage pool.

EXAMPLE

a typical email system might contain 100 instances of the same 1 megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB of storage space. With data deduplication, only one instance of the attachment is stored; each subsequent instance is referenced back to the one saved copy. In this example, a 100 MB storage demand drops to 1 MB.

2. Data archiving and data compression can also reduce the amount of data that has to be stored on primary storage systems.

Data compression reduces the size of a file by removing redundant information from files so that less disk space is required. This is accomplished natively in storage systems using algorithms or formulas designed to identify and remove redundant bits of data.

Data compression is a reduction in the number of bits needed to represent data. Compressing data can save storage capacity, speed up file transfer, and decrease costs for storage hardware and network bandwidth.

EXAMPLE

(A) Data compression can dramatically decrease the amount of storage a file takes up.

For example, in a 2:1 compression ratio, a 20 megabyte (MB) file takes up 10 MB of space. As a result of compression, administrators spend less money and less time on storage.

(B) Virtually any type of file can be compressed, but it's important to follow best practices when choosing which ones to compress. For example, some files may already come compressed, so compressing those files would not have a significant impact.

EVALUATION

(A) When evaluating data compression algorithms, speed is always in terms of uncompressed data handled per second.

(B)

A algorithm that can take a 2 MB compressed file and decompress it to a 10 MB file has a compression ratio of 10/2 = 5, sometimes written 5:1 (pronounced "five to one"

(C) For streaming audio and video, the compression ratio is defined in terms of uncompressed and compressed bit rates instead of data sizes:

For example, songs on a CD are uncompressed with a data rate of 16 bits/sample/channel x 2 channels x 44.1 kSamples/s = 1.4 Mbit/s. That same song encoded at (lossy "high quality") 128 kbit/s Vorbis stream (or 128 kbit/s MP3 stream or a 128 kbit/s AAC file) yields a compression ratio of about 11:1 ("eleven to one").


Related Solutions

Discuss techniques and methods of presenting financial data for non-reporting entities.
Discuss techniques and methods of presenting financial data for non-reporting entities.
Evaluate the given expression and express the results using the usual format for writing numbers (instead...
Evaluate the given expression and express the results using the usual format for writing numbers (instead of scientific notation) 32C2=
Explain comparative and non-comparative scaling. Discuss any two types of non-comparative scaling techniques with examples.
Explain comparative and non-comparative scaling. Discuss any two types of non-comparative scaling techniques with examples.
Give two examples of sterilization techniques and describe them
Give two examples of sterilization techniques and describe them
Question 8 When reporting data in a results section, it is ______ necessary to identify the...
Question 8 When reporting data in a results section, it is ______ necessary to identify the type of t test computed. a. significantly b. not c. usually d. always
What are the forms of supporting materials? Explain any two with examples
What are the forms of supporting materials? Explain any two with examples
What are examples of effective techniques for team decision making? What are examples of strategies for...
What are examples of effective techniques for team decision making? What are examples of strategies for avoiding potential liabilities in team decision making (e.g., groupthink)?
Explain the below: Techniques of data collection, Examples of issues with data collection, Overview of some...
Explain the below: Techniques of data collection, Examples of issues with data collection, Overview of some topics in data management and Overview of defining metrics
Does health care need to evaluate costs using data mining techniques?
Does health care need to evaluate costs using data mining techniques?
High Moderate Low Criterion: Report ANOVA results in APA format. Data: Use the results from Problem...
High Moderate Low Criterion: Report ANOVA results in APA format. Data: Use the results from Problem Set 4.4. Instructions: Complete the following: a. State the null hypothesis. b. Report your results in APA format (as you might see them reported in a journal article). High 10 7 8 12 6 Moderate 9 4 7 6 8 Low 9 4 6 5 7
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT