Question

In: Computer Science

Create a Word document that outlines how your ETL Process works. This document should be written...

Create a Word document that outlines how your ETL Process works. This document should be written as a technical support document that new hire can read to understand what the process does and technicians might use to troubleshoot the process. Include screen shots,

Solutions

Expert Solution

What is ETL?

ETL is an abbreviation of Extract, Transform and Load. In this process, an ETL tool extracts the data from different RDBMS source systems then transforms the data like applying calculations, concatenations, etc. and then load the data into the Data Warehouse system.

It's tempting to think a creating a Data warehouse is simply extracting data from multiple sources and loading into database of a Data warehouse. This is far from the truth and requires a complex ETL process. The ETL process requires active inputs from various stakeholders including developers, analysts, testers, top executives and is technically challenging.

In order to maintain its value as a tool for decision-makers, Data warehouse system needs to change with business changes. ETL is a recurring activity (daily, weekly, monthly) of a Data warehouse system and needs to be agile, automated, and well documented.

Why do you need ETL?

There are many reasons for adopting ETL in the organization:

  • It helps companies to analyze their business data for taking critical business decisions.
  • Transactional databases cannot answer complex business questions that can be answered by ETL.
  • A Data Warehouse provides a common data repository
  • ETL provides a method of moving the data from various sources into a data warehouse.
  • As data sources change, the Data Warehouse will automatically update.
  • Well-designed and documented ETL system is almost essential to the success of a Data Warehouse project.
  • Allow verification of data transformation, aggregation and calculations rules.
  • ETL process allows sample data comparison between the source and the target system.
  • ETL process can perform complex transformations and requires the extra area to store the data.
  • ETL helps to Migrate data into a Data Warehouse. Convert to the various formats and types to adhere to one consistent system.
  • ETL is a predefined process for accessing and manipulating source data into the target database.
  • ETL offers deep historical context for the business.
  • It helps to improve productivity because it codifies and reuses without a need for technical skills.

ETL Process in Data Warehouses

ETL is a 3-step process

Step 1) Extraction

In this step, data is extracted from the source system into the staging area. Transformations if any are done in staging area so that performance of source system in not degraded. Also, if corrupted data is copied directly from the source into Data warehouse database, rollback will be a challenge. Staging area gives an opportunity to validate extracted data before it moves into the Data warehouse.

Data warehouse needs to integrate systems that have different

DBMS, Hardware, Operating Systems and Communication Protocols. Sources could include legacy applications like Mainframes, customized applications, Point of contact devices like ATM, Call switches, text files, spreadsheets, ERP, data from vendors, partners amongst others.

Hence one needs a logical data map before data is extracted and loaded physically. This data map describes the relationship between sources and target data.

Three Data Extraction methods:

  1. Full Extraction
  2. Partial Extraction- without update notification.
  3. Partial Extraction- with update notification

Irrespective of the method used, extraction should not affect performance and response time of the source systems. These source systems are live production databases. Any slow down or locking could effect company's bottom line.

Some validations are done during Extraction:

  • Reconcile records with the source data
  • Make sure that no spam/unwanted data loaded
  • Data type check
  • Remove all types of duplicate/fragmented data
  • Check whether all the keys are in place or not

Step 2) Transformation

Data extracted from source server is raw and not usable in its original form. Therefore it needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL process adds value and changes data such that insightful BI reports can be generated.

In this step, you apply a set of functions on extracted data. Data that does not require any transformation is called as direct move or pass through data.

In transformation step, you can perform customized operations on data. For instance, if the user wants sum-of-sales revenue which is not in the database. Or if the first name and the last name in a table is in different columns. It is possible to concatenate them before loading.

Following are Data Integrity Problems:

  1. Different spelling of the same person like Jon, John, etc.
  2. There are multiple ways to denote company name like Google, Google Inc.
  3. Use of different names like Cleaveland, Cleveland.
  4. There may be a case that different account numbers are generated by various applications for the same customer.
  5. In some data required files remains blank
  6. Invalid product collected at POS as manual entry can lead to mistakes.

Validations are done during this stage

  • Filtering – Select only certain columns to load
  • Using rules and lookup tables for Data standardization
  • Character Set Conversion and encoding handling
  • Conversion of Units of Measurements like Date Time Conversion, currency conversions, numerical conversions, etc.
  • Data threshold validation check. For example, age cannot be more than two digits.
  • Data flow validation from the staging area to the intermediate tables.
  • Required fields should not be left blank.
  • Cleaning ( for example, mapping NULL to 0 or Gender Male to "M" and Female to "F" etc.)
  • Split a column into multiples and merging multiple columns into a single column.
  • Transposing rows and columns,
  • Use lookups to merge data
  • Using any complex data validation (e.g., if the first two columns in a row are empty then it automatically reject the row from processing)

Step 3) Loading

Loading data into the target datawarehouse database is the last step of the ETL process. In a typical Data warehouse, huge volume of data needs to be loaded in a relatively short period (nights). Hence, load process should be optimized for performance.

In case of load failure, recover mechanisms should be configured to restart from the point of failure without data integrity loss. Data Warehouse admins need to monitor, resume, cancel loads as per prevailing server performance.

Types of Loading:

  • Initial Load — populating all the Data Warehouse tables
  • Incremental Load — applying ongoing changes as when needed periodically.
  • Full Refresh —erasing the contents of one or more tables and reloading with fresh data.

Load verification

  • Ensure that the key field data is neither missing nor null.
  • Test modeling views based on the target tables.
  • Check that combined values and calculated measures.
  • Data checks in dimension table as well as history table.
  • Check the BI reports on the loaded fact and dimension table.

ETL tools

There are many Data Warehousing tools are available in the market. Here, are some most prominent one:

1. MarkLogic:

MarkLogic is a data warehousing solution which makes data integration easier and faster using an array of enterprise features. It can query different types of data like documents, relationships, and metadata.

http://developer.marklogic.com/products

2. Oracle:

Oracle is the industry-leading database. It offers a wide range of choice of Data Warehouse solutions for both on-premises and in the cloud. It helps to optimize customer experiences by increasing operational efficiency.

https://www.oracle.com/index.html

3. Amazon RedShift:

Amazon Redshift is Datawarehouse tool. It is a simple and cost-effective tool to analyze all types of data using standard SQL and existing BI tools. It also allows running complex queries against petabytes of structured data.

https://aws.amazon.com/redshift/?nc2=h_m1

Here is a complete list of useful Data warehouse Tools.

Best practices ETL process

Never try to cleanse all the data:

Every organization would like to have all the data clean, but most of them are not ready to pay to wait or not ready to wait. To clean it all would simply take too long, so it is better not to try to cleanse all the data.

Never cleanse Anything:

Always plan to clean something because the biggest reason for building the Data Warehouse is to offer cleaner and more reliable data.

Determine the cost of cleansing the data:

Before cleansing all the dirty data, it is important for you to determine the cleansing cost for every dirty data element.

To speed up query processing, have auxiliary views and indexes:

To reduce storage costs, store summarized data into disk tapes. Also, the trade-off between the volume of data to be stored and its detailed usage is required. Trade-off at the level of granularity of data to decrease the storage costs.

Summary:

  • ETL is an abbreviation of Extract, Transform and Load.
  • ETL provides a method of moving the data from various sources into a data warehouse.
  • In the first step extraction, data is extracted from the source system into the staging area.
  • In the transformation step, the data extracted from source is cleansed and transformed.
  • Loading data into the target data warehouse is the last step of the ETL process.


Related Solutions

Create a short Word document, no longer than a page, that describes “How the Internet Works.”
Create a short Word document, no longer than a page, that describes “How the Internet Works.” This can be from any perspective you wish, browser, email, networking but will probably have to cover a few of these ideas. You will receive 3 points for each keyword and although I encourage you to use more than 5, the maximum number of points you can receive is 15.Terms (Remember, you need to use at least 5 to get full credit)Public IP address             ...
Based on your understanding of the topic, create a report in a Microsoft Word document answering...
Based on your understanding of the topic, create a report in a Microsoft Word document answering the following questions: What are the benefits measuring the health status of individuals aged fifteen to sixty four years in a particular community or an entire population? What effect do these benefits have on the health status of the community or the entire population? Which one health statistic has had the most significant impact on the health care profession in the entire US or...
Review the following scenario, and answer the questions. Create a Word document for your answers, submit...
Review the following scenario, and answer the questions. Create a Word document for your answers, submit via submission link. A 28-year-old primigravida at 41 weeks’ gestation is admitted to the L&D unit for early labor at 2 cm, 70% effaced, and 0 station. How can the nurse best describe to this patient the latent phase of labor? How will the cardinal movements of labor facilitate the birth of the fetus?
VBA In this assignment you will create a Word Macro-Enabled Document. The document will contain procedures...
VBA In this assignment you will create a Word Macro-Enabled Document. The document will contain procedures with the following declarations: • Sub Main() • Sub GetGrades(grades() As Integer, total As Integer) • Sub CreateOutput(className As String, grades() As Integer, total As Integer) The Main sub-procedure will be the macro that initializes the task. In this procedure you will declare all necessary variables, prompt the user to enter the name of the class and the number of students, resize the grades...
create a document in microsoft word or excel document which has the following: Fictitious names and...
create a document in microsoft word or excel document which has the following: Fictitious names and addresses of 10 folks including name, address and phone number. Recreate the bike table, which includes bike name, part number and hourly rate Please limit your submission to no more than 2 pages.
Create a Word document and title it “College Expenses”. In the Word document, insert a table with at least 5 rows and 5 columns. Insert>Table.
Assignment 3 – Incorporating a Table into a Document.Create a Word document and title it “College Expenses”. In the Word document, insert a table with at least 5 rows and 5 columns. Insert>Table.Tell me about your college expenses you have by filling this table with subjects and data. Then write two paragraphs telling me about the information you provided in the table. Bold and color table heading.  Example of table:College ExpensesTuitionBooksComputer/InternetOther suppliesScience ClassMath classC.I.S. ClassEnglish ClassGive the page a proper title....
research the Caesar Cipher, and document your research in 2-3 pages in a Word document with...
research the Caesar Cipher, and document your research in 2-3 pages in a Word document with illustrations in your document and provide an example of how the Caesar Cipher works while encrypting and decrypting a sample message. Also, add its history and its internal workings. Summary of the areas to complete: - Research the Caesar Cipher. - Show its Encryption Method in details. - Show its Decryption Method in details. - Describe its History and what it was used for....
Identify the various tools that can be used to create a professional document in Microsoft Word....
Identify the various tools that can be used to create a professional document in Microsoft Word. Choose at least one of the tools which you would consider useful when creating your document.
1. Create a Word document in the Documents library. Write a small “summary or review” of...
1. Create a Word document in the Documents library. Write a small “summary or review” of an interesting “technology article” – The topics can be anything related to recent news in technology “Example: Launching of new gadgets of Apple, or Artificial Intelligence and its uses, Amazon Web Services etc” please include workcited
In a Word document, prepare a 2-3 page (500-750 words) written response to the following items...
In a Word document, prepare a 2-3 page (500-750 words) written response to the following items to submit to your professor: Identify the types of data and information collected in Telemedicine. Select two of the types of data and explain how each is transmitted
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT