[Note: This is the first of two articles by Kehinde O Eseyin, an OCP DBA working with the Nigerian Ports Authority, Delta ports, Warri. Nigeria. Kehinde can be contacted at pen4kenny@yahoo.ca]
INTRODUCTION
Data change is a phenomenon that is highly inevitable in a data warehouse. A data warehouse however, is a central repository containing enterprise-wide information used for strategic decision-making. The fact that the information in a data warehouse is imperative for decision supports and operational business processes cannot be over emphasized. Since decisions to be taken are dependent on the information presently available in the warehouse, it is pertinent that data currency, correctness and accuracy be ensured. Data currency however is dependent on ETL process (how often data extraction occurs and the latency (delay) of load). Therefore, to guarantee data currency, oracle has some in-built functionalities targeted towards this direction.
Some features like SQL procedures, materialized views, external tables and merge command are useful in capturing change. Thus, this article focuses on the some features of oracle that supports the capture of change data, particularly oracle change data capture framework, an oracle 9i release 1 feature and oracle streams, an oracle 9i release 2 feature. The article further went ahead to summarily explore major enhancements in oracle 10g that make capturing change easier. The article concludes by pointing out issues to be taken note of, in future releases of oracle.
WHAT IS AN ETL PROCESS?
ETL process - acronymic for extraction, transformation and loading operations are a fundamental phenomenon in a data warehouse. Whenever DML (data manipulation language) operations such as INSERT, UPDATE OR DELETE are issued on the source database, data extraction occurs.
After data extraction and transformation have taken place, data are loaded into the data warehouse. Incremental loading is beneficial in the sense that only that have changed after the last data extraction and transformation are loaded.
ORACLE CHANGE DATA CAPTURE FRAMEWORK
The change data framework is designed for capturing only insert, delete and update operations on the oracle database, that is to say they are ‘DML sensitive‘. Below is architecture of change data capture framework. Below is architecture illustrating the flow of information in an oracle data capture framework.
Implementing oracle change data capture is very simple. Following the following steps, guides you through the whole implementation process.
ORACLE STREAMS
An oracle stream, a new feature in oracle9i release 2, is an information-sharing technology, which allows the propagation and management of data, event, and transactions within oracle databases or between oracle and non-oracle databases. Oracle streams is flexible in the sense that it allows user-intervention- users can specify what information goes into the stream, the route of flow, what happens to events in the stream and how the stream terminates. It is used to capture events such as DML (insert, update and delete) and DDL (alter, drop, rename) operations.
However, very fundamental to the working of oracle stream are three elements/components namely
Figure 2. Architecture of oracle stream.
The capture process
The capture process is responsible for the identification of data to capture such as database changes (DDL and DML) and application generated messages. We can have either implicit capture in which the server captures DML and DDL events at a source database using oracle‘s default rules or explicit capture, in which a customized configuration is used to capture data using procedures.
Furthermore, the change process formats the retrieved data into events called logical change records (LCR) and they are then placed in a queue- staging environment for onward processing. Logical change records are of two types- DDL LCR and row LCR. DDL LCR refers to changes made in the database objects by issuing ALTER, RENAME, CREATE or DROP commands. Row LCR on the other hand, refers to the modification of a single row of a table by a single DML statement. This implies that updating 10 rows in a table using a single DML statement generates 10 row LCRs.
The staging process
The staging area is a queue and acts as a temporary repository for logical change records (LCR) until they are subscribed to. The subscriber (a user application or another staging area or default apply process) has control over the contents of the staging area. Therefore, the subscriber can decide which records are propagated or consumed from the queue. For events propagation from a queue to take place, a user must be the owner of the queue and appropriate privileges are needed, not only on the source queue but also on the target queue. Moreover, a particular target queue can accept events from more than one source queues
The apply process
The apply process is responsible for applying the changes to the target database. This is possible in two ways namely- Default consumption (implicit) and customized consumption (explicit). In default consumption, the apply engine is used to apply the changes to the database directly. Per adventure if a conflict occurs, the apply engine resolves it by invoking resolution (data transmission) routines.
In customized consumption, the logical change records are passed as arguments to a user-defined function for processing. If the customized procedure process DML LCRs, DDL LCRs and enqueued messages, they are called DML handlers, DDL handlers and message handlers respectively.
ORACLE 10G AND DATA CHANGE CAPTURE
As oracle data warehousing becomes increasingly more complex by the day, powerful features that simplify the ETL process have been added to the latest release of oracle-oracle 10g. Essentially, much has been done in the area of facilitating the extraction and movement of large volumes of data between oracle databases. The emergence of heterogeneous transportable tablespaces and oracle data pump are testimonies to this fact and would be described later in this paper.
From the inherent data structure perspective, change data capture can either occur at the data level or at the application level. At the data level, a table in the target database is regarded as a remote snapshot of a table in the origin database. At whichever level capturing and propagation is taking place, it is imperative to note that there is always an increase in the workload on the source database. However, with oracle 10g, additional overhead is a forgotten issue. Asynchronous CDC is now adopted, in which change data is extracted from the redo logs without any negative performance implication on the source database. Furthermore, asynchronous CDC can be described as a lightweight technology targeted towards change extraction and propagation in a data warehousing system and in which changes to the source tables are viewed as relational data for onward consumption by subscribers.
There is therefore no gain saying that asynchronous CDC has greatly enhanced parallel log file processing and data transformation.
Heterogeneous Transportable Tablespaces.
Transportable tablespaces was introduced in oracle 8.0. The movement of data using transportable tablespaces is much more faster than when compared to the export/import methodology of the same data. This is because; tablespaces transportation involves the copying of datafiles and integrating the database schema information into the data dictionary.
Transportable tablespaces has proved to be useful in diverse ways, which includes
In as much as the benefits are enormous, transportable tablespaces have suffered some limitations over the years. These setbacks includes
However, with the advent of oracle 10g, the monopoly, inherent in former releases of oracle as it relates to same operating system for the source database and target database has been eradicated. Transportable tablespaces are now platform-friendly as in; you can transport tablespaces from databases of different platforms. When transporting tablespaces of different platforms, the RMAN utility and the CONVERT command are used to convert the byte ordering to the same thing. Alternatively, you can convert the destination platform after the tablespace datafiles have been transported.
Oracle Data Pump
One of the new features of oracle 10g is the oracle data pump. It is indeed an exciting server-side infrastructure suitable for fast, bulk-data and metadata movement from one oracle database to another database. Oracle data pump is highly flexible in that, it only not allows you to use a customized data movement utility, but also allows you monitor status, cancel, suspend and resume a load. Also, loading can be restarted after failure without the loss of data integrity.
When unloading data physically, oracle data pumps makes use of the external table as the unload mechanism. Thus, resulting in the external table data pump unload driver unloading the driver into platform-independent, oracle proprietary files. The new external table unload mechanism can be used as a standalone without necessarily using the new export/import utilities (expdp/impdp) which are client side utilities that make application programming interface (API) calls/requests into the oracle data pump.
CONCLUSION
This article has really given insight into how change data can be captured and ETL process management using oracle technologies. Oracle has remained committed to ensuring that its manageability features become more flexible. This fact has been revealed in this paper by the chronological analysis of how various releases of oracle, starting from oracle 9i release 1 to oracle 9i release 2 and then to the latest release, oracle 10g handles changed data. The dynamic nature of ETL process management in oracle is highly commendable- from the conventional transform, then load and load, then transform methodologies to transform-while-loading methodology and from synchronous CDC to asynchronous CDC.
By and large, it is important to suggest that oracle in it‘s quest towards providing better ETL management functionalities, should look into the homogeneity associated with its latest release as it relates to asynchronous change data capture and transportable tablespaces to mention a few, in which the source and destination database must be 10g. Furthermore, it is worth mentioning that other drawbacks on transportable tablespaces that are yet to be addressed should be looked into in the next release.
Good luck!
聯(lián)系客服