In: Computer Science
Why are the original/raw data not readily usable by analytics tasks? What are the main data preprocessing steps? List and explain their importance in analytics.
Original/raw data is not readily usable for analytic tasks because the raw data is usually dirt, that is they contain discrepencies and incomplete information. Moreover, they are not aligned properly, can be extremely complex and inaccurate. Using this raw data will not give us accurate results and sometimes might give errors too. So we need to processs this raw data to convert it into refined data.
The main data preprocessing steps include:
1. Data consolidation: First the data needs to be collected. Then from the crude data we need to select the data relevant for our analytics task and then integrate this data. This reduces overhead and extra computational steps.
2. Data cleaning: This is used to impute data, reduce the noise or dirt data and eliminate redundant data. This helps us achieve accurate results.
3. Data transformation: We have to normalize the data so that all the data are scaled evenly. We then discretize data and create attributes for data. An ordered data is very useful in analysing data.
4. Data Reduction: Entire raw data is not required for analytics, hence we reduce the dimensions of data, reduce the volume and balance accordingly.