How to sync data from Oracle to Kafka in real time: Debezium, Oracle Golden Gate, and Tapdata

March 28, 2024
Amidst the thriving information age, the imperative for real-time data processing has emerged as a pivotal catalyst propelling business innovation and growth. In this dynamic landscape, numerous enterprises are opting to stream Oracle change data capture (CDC) events to Kafka to meet the escalating demand for real-time data processing. This article will delve deep into the rationale driving this trend and offer nuanced comparisons of prevalent solutions, empowering you with fresh perspectives for technical decision-making.

Why: Understanding the need for streaming real-time Oracle CDC events to Kafka

Let's begin by examining a common business scenario. In real-time inventory management, the ability to promptly respond to inventory changes is crucial for ensuring timely replenishment and order processing. By streaming Oracle CDC events with Kafka, enterprises can achieve real-time capture of inventory changes and efficiently transmit these change events to downstream business components such as order systems and reporting systems. This enables businesses to manage inventory in a flexible and timely manner, thereby improving overall supply chain efficiency. While there are alternative approaches such as publishing events to Kafka from the source applications rather than the database, this requires far more development effort and is more complex for handling large transactions and keeping the messages in sync with the database state.

What is so good about Kafka?

Considering the inherent traits of Oracle and Kafka, we can identify the practical needs behind Oracle to Kafka data synchronization. These needs often involve fulfilling real-time data demands, enabling large-scale data processing, constructing event-driven architectures, and improving overall system performance. Here are some common reasons for syncing to Kafka:
  • Real-time data streaming: Kafka, as a distributed messaging queue system, provides high throughput and low latency for real-time data processing. Streaming Oracle CDC events to Kafka facilitates real-time capture and processing of data, enabling businesses to rapidly respond to data changes.
  • High availability: Kafka ensures high availability through distribution. A Kafka cluster usually consists of multiple brokers, each responsible for storing a segment of the data with replication. Thus, even in the event of a broker failure, other brokers can continue operating, ensuring service availability.
  • Scalability: Kafka's distributed architecture, coupled with features such as consumer groups, partitioning strategy, replication mechanism, horizontal scalability, high concurrency processing capabilities, and fault tolerance, enables it to handle massive message data, meeting demands for high throughput and concurrency.
  • Efficient querying: Leveraging techniques like sequential write, indexing, binary search, and in-memory caching, Kafka efficiently processes massive message streams while maintaining high performance and low latency for efficient data retrieval.
  • High-concurrency writes: Kafka emphasizes high-concurrency data writes and employs various technologies such as zero-copy, batch processing, message compression, and asynchronous processing to enhance data transfer efficiency and processing speed. This capability enables Kafka to handle large volumes of data streams effectively, making it suitable for high-throughput scenarios such as log recording and event sourcing.
  • Decoupling data producers and consumers: Kafka's message queue model facilitates decoupling between data producers and consumers, allowing database changes to form a loose coupling with actual data consumers such as applications and analytics systems, thereby enhancing system flexibility.
  • Support for event-driven architecture: Synchronizing Oracle data to Kafka enables the construction of event-driven architectures. Database changes can be transmitted as event streams, triggering actions in other components of the system and enabling more flexible and agile business processes.
  • Data Integration: As middleware, Kafka can coordinate data flows between different systems, facilitating seamless integration with other data sources and targets. This capability enables systems to better adapt to complex data processing and exchange requirements.

Why real-time data sync matters?

During the process of synchronizing data to Kafka, the importance of having a real-time capability becomes increasingly prominent. Here are a few common scenarios as examples:
  • Real-time reporting and monitoring systems: Businesses rely on real-time monitoring and reporting for operational insights. This includes operational and performance monitoring, where access to database changes in real-time is paramount. By synchronizing data to Kafka in real-time, monitoring systems remain current and effective.
  • Event-driven architectures: Modern applications often adopt event-driven architectures, facilitating system integration through a publish-subscribe model. Real-time data synchronization to Kafka is vital for ensuring timely event propagation and processing.
  • Enhanced user experience: Applications requiring real-time interaction and responsiveness demand up-to-date data presentation. For instance, in online collaboration or real-time communication apps, users expect immediate visibility into others' actions and changes.

By now, we have gained a basic understanding of the importance of real-time synchronization of data from Oracle to Kafka. Next, let's explore some common synchronization methods.

How: comparison between different data sync solutions

Manual method: leveraging open source tools

Implementing real-time Oracle to Kafka data synchronization manually entails several steps, including Oracle database setup, Kafka environment configuration, and synchronization program development. Below is a concise example of a manual method, focusing on utilizing the Debezium open-source tool for real-time Oracle to Kafka synchronization.

Step 1: Preparations

  1. Install Oracle database: Ensure that Oracle database is installed and properly configured.
  2. Install Kafka: Install Kafka and start the ZooKeeper service as a dependency for Kafka.
  3. Install and configure Debezium: Debezium is an open-source change data capture tool used to monitor database changes and send them to Kafka. Download and configure the Debezium Connector for Oracle. (https://debezium.io/)

Step 2: Configure Oracle database

  1. Enable archive log: In the Oracle database, ensure that the archive log is enabled, which is a prerequisite for Debezium to monitor changes.
ALTER SYSTEM SET LOG_ARCHIVE_DEST_1='LOCATION=/archivelog';
ALTER SYSTEM SET LOG_ARCHIVE_FORMAT='arch_%t_%s_%r.arc';

    2. Create CDC user: Establish a dedicated user for Change Data Capture (CDC) and grant the necessary permissions.

CREATE USER cdc_user IDENTIFIED BY cdc_password;
GRANT CONNECT, RESOURCE, CREATE VIEW TO cdc_user;

   3. Enable CDC: Activate Oracle's CDC feature and designate a CDC user.

EXEC DBMS_CDC_PUBLISH.CREATE_CHANGE_SET('MY_CHANGE_SET', 'CDC_USER');
EXEC DBMS_CDC_PUBLISH.ALTER_CHANGE_SET('MY_CHANGE_SET', 'ADD');
EXEC DBMS_CDC_PUBLISH.CREATE_CAPTURE('MY_CAPTURE', 'CDC_USER');
EXEC DBMS_CDC_PUBLISH.ALTER_CAPTURE('MY_CAPTURE', 'ADD');
EXEC DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE('MY_CHANGE_TABLE', 'CDC_USER', 'MY_CAPTURE', 'MY_CHANGE_SET');

Step 3: Configuring the Debezium Connector

  1. Configure the Debezium Connector: Create a JSON configuration file specifying Oracle connection details, monitored tables, and other relevant information.
{"name": "oracle-connector", // Name assigned to the connector when registering the service.
"config": {"connector.class": "io.debezium.connector.oracle.OracleConnector", // Name of the Oracle connector class.
           "database.server.name": "my-oracle-server", //Logical name providing namespace for identifying the Oracle database server for capturing changes by the connector.
           "database.hostname": "your-oracle-host", //Oracle instance address.
           "database.port": "your-oracle-port", //Oracle database port.
           "database.user": "cdc_user", //Oracle database user.
           "database.password":"cdc_password", //Oracle database password.
           "database.dbname":"your-oracle-database",//Name of the database from which changes are to be captured.
           "database.out.server.name":"oracle-server", // Kafka topic.
           "table.include.list": "CDC_USER.MY_TABLE",//Tables in Oracle to monitor for output data.
           "schema.history.internal.kafka.bootstrap.servers": "192.3.65.195:9092",//Kafka broker list for this connector to write and recover DDL statements to and from the database history topic.
           "schema.history.internal.kafka.topic": "schema-changes.inventory" // Name of the database history topic where the connector writes and recovers DDL statements.
           }
 }

    2. Starting the Debezium Connector: Use Kafka Connect to launch the Debezium Connector.

bin/connect-standalone.sh config/worker.properties config/debezium-connector-oracle.properties

Step 4: Validating synchronization

  1. Insert data: Insert some data into the Oracle database.
INSERT INTO CDC_USER.MY_TABLE (ID, NAME) VALUES (1, 'John Doe');  
  1. Check Kafka topics: Verify if there are any messages related to table changes in Kafka.
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-oracle-server.CDC_USER.MY_TABLE --from-beginning
The JSON messages related to insertion operations should now be visible.
The above is just a simple example, and the actual scenario may be more complex. Specific configurations and operations may depend on different versions of Oracle and Debezium. In production environments, it is essential to follow relevant security and best practices.

Classic approach: using Oracle's official tool OGG

Implementing real-time Oracle to Kafka data synchronization with Oracle GoldenGate (OGG) also involves multiple steps. Below is a basic example using OGG Classic Replicat. Please note that specific configurations may vary depending on the version of Oracle GoldenGate.

Step 1: Preparations

  1. Install Oracle GoldenGate: Install and configure the Oracle GoldenGate software.
  2. Install Kafka: Install Kafka and start the ZooKeeper service as a dependency for Kafka.

Step 2: Configuring the Oracle database

  1. Enable Archivelog Mode: Ensure that archivelog mode is enabled for the Oracle database.
ALTER SYSTEM SET LOG_ARCHIVE_DEST_1='LOCATION=/archivelog';
ALTER SYSTEM SET LOG_ARCHIVE_FORMAT='arch_%t_%s_%r.arc';

Step 3: Configuring OGG extract and pump

  1. Create OGG Extract: Configure OGG Extract to capture change data.
cd $OGG_HOME
./ggsci
GGSCI> ADD EXTRACT ext1, TRANLOG, BEGIN NOW
GGSCI> ADD EXTTRAIL /trail/et, EXTRACT ext1
GGSCI> ADD EXTRACT dpump, EXTTRAILSOURCE /trail/et
GGSCI> ADD RMTTRAIL /trail/rt, EXTRACT dpump

    2. Configuring OGG Pump: Configure the OGG Pump to transfer captured change data to Kafka.

GGSCI> ADD EXTRACT pump1, EXTTRAILSOURCE /trail/rt, BEGIN NOW
GGSCI> ADD RMTTRAIL /trail/pt, EXTRACT pump1
GGSCI> ADD REPLICAT rep1, EXTTRAIL /trail/pt, SPECIALRUN

Step 4: Configuring OGG Replicat and Kafka

  1. Edit the OGG Replicat parameter file: Edit the Replicat parameter file to configure connection information and target Kafka topics.
REPLICAT rep1
USERID ogguser, PASSWORD oggpassword
ASSUMETARGETDEFS
MAP source_table, TARGET kafka_topic, COLMAP (...)

    2. Start OGG Replicat: Start the Replicat process.

./ggsci
GGSCI> START REPLICAT rep1

Step 5: Validating the synchronization

  1. Insert data: Insert some data into the Oracle database.
INSERT INTO source_table (ID, NAME) VALUES (1, 'John Doe');
  1. Check Kafka topic: Verify if there are messages related to table changes in Kafka.
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic kafka_topic --from-beginning

Next-generation real-time data platform tools: cost-effective and user-friendly

Taking Tapdata as an example, which is a modern data platform prioritizing low-latency data movement. It boasts over 100 built-in data connectors, ensuring stable real-time data collection and transmission, sub-second response times for processing, and user-friendly, low-code operations. Common use cases include database replication, data warehousing, and ETL processing.

Tapdata is specifically designed for real-time data synchronization, providing robust and stable data pipeline capabilities. It serves as a viable alternative to traditional synchronization tools like OGG/DSG, enabling seamless data synchronization from databases like Oracle and MySQL to various data targets, whether homogeneous or heterogeneous.

Here is the detailed operational tutorial (demonstration version is Tapdata Cloud):

Step 1: Tapdata installation and deployment

  1. Register and log in to Tapdata Cloud.
  2. Install and deploy Tapdata: Visit the Tapdata official website, follow the instructions, and complete the installation and deployment of Tapdata Agent.

Step 2: Configure data sources and targets

  1. Create a new Oracle data source: Go to the Tapdata Cloud connection management page, create a connection for Oracle data source, and verify its connectivity.

  1. Creating Kafka Data Target: Repeat the previous steps, locate Kafka in the list of data sources, refer to the connection configuration guide to create a connection to Kafka as the data target, and test it successfully:

Step 3: Configuring Kafka

  1. Creating Kafka topic: Create a topic in Kafka to receive data synchronized from Oracle.
bin/kafka-topics.sh --create --topic my_oracle_topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

Step 4: Initiating synchronization tasks

  1. Creating data sync task: Utilize Tapdata's visual interface to quickly create an Oracle-Kafka data synchronization task by dragging and dropping connections between the data source and target.
  2. Starting sync task: Click on the source and target nodes, select the tables to be synchronized, and launch the task. Tapdata will begin capturing data and changes from the Oracle database and sending them to Kafka.

Step 5: Validating synchronization

  1. Inserting data: Insert some data into the Oracle database.
INSERT INTO my_table (id, name) VALUES (1, 'John Doe');
  1. Checking Kafka topic: Use Kafka command-line tools to verify if the synchronized data has arrived at the Kafka topic.
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my_oracle_topic --from-beginning

How to choose: comprehensive comparison and selection of solutions

When evaluating various data synchronization solutions, it involves weighing multiple factors. Here's a brief overview of the pros and cons of each type of solution:

① Manual configuration method

Pros:

  1. Customization: Offers full customization to meet specific business needs.
  2. Cost-effective: No additional software licensing fees are required.

Cons:

  1. Complexity: Requires manual handling of all steps, potentially increasing configuration and management complexity.
  2. Maintenance Challenges: For complex synchronization needs, manual configuration may lead to increased maintenance efforts.
  3. Time-consuming: Manual configuration demands more time and technical expertise.

② OGG

Pros:

  1. Maturity and Reliability: OGG, provided by Oracle, is stable and matured after years of development.
  2. Visual Management: Provides a user-friendly management interface, simplifying configuration and monitoring.

Cons:

  1. Cost: It's a paid tool with a relatively high price tag, requiring a financial investment.
  2. Learning Curve: It comes with a learning curve, particularly for beginners.

③ Tapdata

Pros:

  1. Simplified Configuration: Tapdata offers an intuitive configuration interface, reducing complexity in both setup and maintenance.
  2. Real-time Monitoring: Features real-time monitoring and alerting, facilitating management and maintenance.
  3. Low Latency: Emphasizes low latency and task optimization based on throughput, ideal for high real-time requirements.
  4. Cloud-native: Supports cloud deployments, seamlessly integrating with cloud ecosystems.

Cons:

  1. Cost Consideration: Both local deployment and Tapdata Cloud incur additional fees after reaching a certain usage threshold.
  2. Resource Efficiency: Requires some database resources for log parsing.

Considerations:

  • Tailor to specific needs: Select the solution that aligns with your requirements and team expertise. Manual configuration suits teams with deep technical knowledge, OGG is suitable for stable environments if there's a large enough budget, while Tapdata excels in agility and low latency.
  • Cost-Effectiveness: Balance costs, learning curves, and configuration efficiency when making a decision.
  • Ecosystem Compatibility: Evaluate how well the solution integrates with existing systems and tools.
Overall, synchronizing Oracle data to Kafka empowers enterprises with flexible, efficient, and real-time data processing capabilities, fostering modern data architectures to adapt to dynamic business landscapes. By choosing the right synchronization solution, such as Debezium, OGG, or Tapdata, and optimizing configurations appropriately, enterprises can meet real-time data processing demands effectively, enhancing competitiveness and adaptability. Among these options, Tapdata stands out for its low latency, ease of use, scalability, and real-time monitoring, providing a dependable solution for real-time Oracle to Kafka synchronization.