In: Statistics and Probability
Evaluate any two data reduction techniques with examples. What is the format for reporting results of each?
There are three types of data reduction techniques: feature reduction, case reduction and value reduction.
1. Feature reduction reduces the number of features (columns) in the data set through selection of the most relevant features or combination of two or more features into a single feature.
Possible feature reduction techniques are techniques such as principle components, heuristic feature selection with wrapper method and feature selection with decision trees.
2. Case reduction reduces the number of cases in a data set (rows) which is usually achieved through specialized sampling methods or sampling strategies.
Examples for case reduction techniques are incremental samples, average samples, increasing the sampling period and strategic sampling of key events.
3. Value reduction means reducing the number of different values a feature can take through grouping of values into a single category.
For value reduction prominent techniques are rounding, using k-means clustering and discretization using entropy minimization.
1. The best-known data reduction technique is data deduplication, which eliminates redundant data on storage systems. The deduplication process typically occurs at the storage block level. The system analyzes the storage to see if duplicate blocks exist, and gets rid of any redundant blocks. The remaining block is shared by any file that requires a copy of the block.
Data deduplication -- often called intelligent compression or single-instance storage -- is a process that eliminates redundant copies of data and reduces storage overhead. Data deduplication techniques ensure that only one unique instance of data is retained on storage media, such as disk, flash or tape.
Use the following commands and tools to help you evaluate data deduplication effectiveness:
Action | Explanation |
---|---|
Use the QUERY STGPOOL server command to quickly check deduplication results. | The Duplicate Data Not Stored field
shows the actual reduction of data, in megabytes or gigabytes, and
the percentage of reduction of the storage pool. For example, issue
the following command:
query stgpool format=detailedIf the query is run before reclamation of the storage pool, the Duplicate Data Not Stored value is not accurate because it does not reflect the most recent data reduction. If reclamation did not yet take place, issue the following command to show the amount of data to be removed: show deduppending backkuppool-fileWhere backkuppool-file is the name of the deduplicated storage pool. |
Use the QUERY OCCUPANCY server command. | This command shows the logical amount of storage per file space when a file space is backed up to a deduplicated storage pool. |
EXAMPLE
a typical email system might contain 100 instances of the same 1 megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB of storage space. With data deduplication, only one instance of the attachment is stored; each subsequent instance is referenced back to the one saved copy. In this example, a 100 MB storage demand drops to 1 MB.
2. Data archiving and data compression can also reduce the amount of data that has to be stored on primary storage systems.
Data compression reduces the size of a file by removing redundant information from files so that less disk space is required. This is accomplished natively in storage systems using algorithms or formulas designed to identify and remove redundant bits of data.
Data compression is a reduction in the number of bits needed to represent data. Compressing data can save storage capacity, speed up file transfer, and decrease costs for storage hardware and network bandwidth.
EXAMPLE
(A) Data compression can dramatically decrease the amount of storage a file takes up.
For example, in a 2:1 compression ratio, a 20 megabyte (MB) file takes up 10 MB of space. As a result of compression, administrators spend less money and less time on storage.
(B) Virtually any type of file can be compressed, but it's important to follow best practices when choosing which ones to compress. For example, some files may already come compressed, so compressing those files would not have a significant impact.
EVALUATION
(A) When evaluating data compression algorithms, speed is always in terms of uncompressed data handled per second.
(B)
A algorithm that can take a 2 MB compressed file and decompress it to a 10 MB file has a compression ratio of 10/2 = 5, sometimes written 5:1 (pronounced "five to one"
(C) For streaming audio and video, the compression ratio is defined in terms of uncompressed and compressed bit rates instead of data sizes:
For example, songs on a CD are uncompressed with a data rate of 16 bits/sample/channel x 2 channels x 44.1 kSamples/s = 1.4 Mbit/s. That same song encoded at (lossy "high quality") 128 kbit/s Vorbis stream (or 128 kbit/s MP3 stream or a 128 kbit/s AAC file) yields a compression ratio of about 11:1 ("eleven to one").