Question

In: Computer Science

2.Explain the ETL process in detail and why it is important? Provide examples.

2.Explain the ETL process in detail and why it is important? Provide examples.

Solutions

Expert Solution

`Hey,

Note: If you have any queries related to the answer please do comment. I would be very happy to resolve all your queries.

ETL is a process that extracts the data from different source systems, then transforms the data (like applying calculations, concatenations, etc.) and finally loads the data into the Data Warehouse system. Full form of ETL is Extract, Transform and Load.

It's tempting to think a creating a Data warehouse is simply extracting data from multiple sources and loading into database of a Data warehouse. This is far from the truth and requires a complex ETL process. The ETL process requires active inputs from various stakeholders including developers, analysts, testers, top executives and is technically challenging.

In order to maintain its value as a tool for decision-makers, Data warehouse system needs to change with business changes. ETL is a recurring activity (daily, weekly, monthly) of a Data warehouse system and needs to be agile, automated, and well documented.

Step 1) Extraction

In this step, data is extracted from the source system into the staging area. Transformations if any are done in staging area so that performance of source system in not degraded. Also, if corrupted data is copied directly from the source into Data warehouse database, rollback will be a challenge. Staging area gives an opportunity to validate extracted data before it moves into the Data warehouse.

Data warehouse needs to integrate systems that have different

DBMS, Hardware, Operating Systems and Communication Protocols. Sources could include legacy applications like Mainframes, customized applications, Point of contact devices like ATM, Call switches, text files, spreadsheets, ERP, data from vendors, partners amongst others.

Hence one needs a logical data map before data is extracted and loaded physically. This data map describes the relationship between sources and target data.

Three Data Extraction methods:

  1. Full Extraction
  2. Partial Extraction- without update notification.
  3. Partial Extraction- with update notification

Irrespective of the method used, extraction should not affect performance and response time of the source systems. These source systems are live production databases. Any slow down or locking could effect company's bottom line.

Some validations are done during Extraction:

  • Reconcile records with the source data
  • Make sure that no spam/unwanted data loaded
  • Data type check
  • Remove all types of duplicate/fragmented data
  • Check whether all the keys are in place or not

Step 2) Transformation

Data extracted from source server is raw and not usable in its original form. Therefore it needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL process adds value and changes data such that insightful BI reports can be generated.

In this step, you apply a set of functions on extracted data. Data that does not require any transformation is called as direct move or pass through data.

In transformation step, you can perform customized operations on data. For instance, if the user wants sum-of-sales revenue which is not in the database. Or if the first name and the last name in a table is in different columns. It is possible to concatenate them before loading.

Following are Data Integrity Problems:

  1. Different spelling of the same person like Jon, John, etc.
  2. There are multiple ways to denote company name like Google, Google Inc.
  3. Use of different names like Cleaveland, Cleveland.
  4. There may be a case that different account numbers are generated by various applications for the same customer.
  5. In some data required files remains blank
  6. Invalid product collected at POS as manual entry can lead to mistakes.

Validations are done during this stage

  • Filtering – Select only certain columns to load
  • Using rules and lookup tables for Data standardization
  • Character Set Conversion and encoding handling
  • Conversion of Units of Measurements like Date Time Conversion, currency conversions, numerical conversions, etc.
  • Data threshold validation check. For example, age cannot be more than two digits.
  • Data flow validation from the staging area to the intermediate tables.
  • Required fields should not be left blank.
  • Cleaning ( for example, mapping NULL to 0 or Gender Male to "M" and Female to "F" etc.)
  • Split a column into multiples and merging multiple columns into a single column.
  • Transposing rows and columns,
  • Use lookups to merge data
  • Using any complex data validation (e.g., if the first two columns in a row are empty then it automatically reject the row from processing)

Step 3) Loading

Loading data into the target datawarehouse database is the last step of the ETL process. In a typical Data warehouse, huge volume of data needs to be loaded in a relatively short period (nights). Hence, load process should be optimized for performance.

In case of load failure, recover mechanisms should be configured to restart from the point of failure without data integrity loss. Data Warehouse admins need to monitor, resume, cancel loads as per prevailing server performance.

Types of Loading:

  • Initial Load — populating all the Data Warehouse tables
  • Incremental Load — applying ongoing changes as when needed periodically.
  • Full Refresh —erasing the contents of one or more tables and reloading with fresh data.

Load verification

  • Ensure that the key field data is neither missing nor null.
  • Test modeling views based on the target tables.
  • Check that combined values and calculated measures.
  • Data checks in dimension table as well as history table.
  • Check the BI reports on the loaded fact and dimension table.

Kindly revert for any queries

Thanks.


Related Solutions

Explain, in great detail elasticity in microeconomics. Provide examples.
Explain, in great detail elasticity in microeconomics. Provide examples.
Explain in detail pollution as a negative externality. Provide examples.
Explain in detail pollution as a negative externality. Provide examples.
Explain the ETL process in three or more paragraphs. Be sure to include the definition of...
Explain the ETL process in three or more paragraphs. Be sure to include the definition of ETL, what happens in each stage and what the final result is. Be sure to spend a significant amount of effort explaining the T stage.
Define and explain why the concept of Core Competencies is important? Provide examples of organizational resources...
Define and explain why the concept of Core Competencies is important? Provide examples of organizational resources and capabilities that CANNOT be Core Competencies.
Kindly explain the following in detail: 1. Why is interpolation needed in the rotation process? 2....
Kindly explain the following in detail: 1. Why is interpolation needed in the rotation process? 2. How do the different interpolation methods "nearest neighbor," "bilinear interpolation," and "bicubic interpolation" differ?
In 845 words, please explain in detail and provide three examples of crowdsourcing..
In 845 words, please explain in detail and provide three examples of crowdsourcing..
Explain in detail endothermic and exothermic energy and how it is in chemistry and provide examples...
Explain in detail endothermic and exothermic energy and how it is in chemistry and provide examples of each
Explain in detail with examples the five (5) steps of job analysis process.
Explain in detail with examples the five (5) steps of job analysis process.
Why is it important to know PCAOB law? Please explain in detail
Why is it important to know PCAOB law? Please explain in detail
why health insurance is so important for women, infants and children. Give detail and examples
why health insurance is so important for women, infants and children. Give detail and examples
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT