In: Computer Science
why is it so challenging to capture data changes as part of an extract, transform, and load (ETL) process? WHAT ARE SOME METHODS THAT CAN BE USED TO CAPTURE INCREMENTAL DATA CHANGES And are there any drawbacks to these methods?
(sorry for the caps)
=> The more data there is, the more complicated the replication becomes, because new data is constantly being added, and existing data is constantly changing.
Methods To Capture Incremental Data Changes
There are a few different methods ::
1. Triggers
2. DATE_MODIFIED
3. Log-Based Change Data Capture
Triggers
=> Database triggers can be used to detect data changes.
=> Firing the trigger, and storing the row changes in a shadow table, introduces overhead.
=> The lower-overhead alternative to only store the primary key of the table requires a join back to the source table to retrieve the changes which
- increases the load to retrieve the changes
- loses intermediate changes if multiple changes took place on the same row
=> Also, if changes are made to tables then triggers and shadow tables may also have to be modified, recreated and/or recompiled which introduces extra overhead to manage and maintain the database.
DATE_MODIFIED
=> Filter should run on the DATE_MODIFIED column to only retrieve rows that were modified since the most recent time data was extracted.
=> To apply this method DATE_MODIFIED must be available on all tables and must be reliably set.
=> Database triggers may be a good way to set the values but these may introduce overhead on the transactional application.
=> DATE_MODIFIED may be indexed to lower the impact of the select statement at the cost of storing and continuously updating the additional index.
Log-Based Change Data Capture
=> The biggest benefit of log-based change data capture is the asynchronous nature. Changes are captured independent of the source application performing the changes.
=> Log-based Change Data Capture is generally considered the superior approach to change data capture that can be applied to all possible scenarios including systems with extremely high transaction volumes.
Disadvantages::
=> Interpreting the changes in the transaction log is difficult because there are no documented standards on how the changes are stored.
=> Database vendors may not provide an interface to the transaction logs – documented or not – and even if there is one it may be relatively slow and/or resource intensive.