Although CDC is not new, modern demands for real-time ingestion and serving the freshest possible data to applications gives you an opportunity to re-evaluate the pros and cons of the different techniques.
When looking at the many developments in the data space over the past few years, one key trend is the move to data streaming and processing data in real time. Where in the past batch-driven architectures propagated data changes between systems on an hourly—or even daily—basis, this is no longer good enough: for instance, cloud data warehouses, caches, or search indexes must be updated with a latency of seconds at most in order to provide fresh and up-to-date data views.
One enabling technology for satisfying these requirements is change data capture (CDC). It’s the process of extracting any inserted, updated, or deleted data from a database, then streaming events describing these data changes to consumers with a low latency. Interestingly, CDC isn’t a particularly new idea. It has been around for quite some time in relational databases like Oracle or Db2, but the concept has recently seen a massive uptake of adoption. It has become popular, in particular, in the context of data streaming platforms such as Apache Kafka and Pulsar. Mature open source CDC offerings like Debezium make it very easy to set up change event streams, enabling a large number of data use cases.
In this article, I’d like to discuss several different approaches for implementing CDC, as well as what some key applications are and how CDC fits into the larger picture of modern data streaming architectures.
There are several ways for extracting change events from a database, each with its own pros and cons. So let’s take a closer look at each.
First, there’s query-based CDC. In this approach, a polling loop runs in an interval and identifies any records which have changed since its last execution. This is conceptually rather simple, but there are some caveats. Most importantly, there’s an inherent conflict between running that loop as often as possible—so as to ensure a high degree of data freshness—but also not running it too often, with the goal of avoiding an overload of the database with polling queries.
And no matter how often you poll for changed data, it cannot be guaranteed that there are no intermediary data changes between two loop runs that get missed. In the most extreme case, a record that gets created and deleted may never be captured at all if it just happens fast enough. Another disadvantage is that polling loops cannot capture deleted events, as those records will just be gone. In addition, this approach requires collaboration of the developer when designing the data model because each table needs to provide the information about when a record was last changed, e.g., in the form of an “UPDATED_AT” column. An advantage is that polling-based CDC is rather simple to implement, and it doesn’t require any specific capabilities in the database itself. For example, it works with a large variety of databases.
Another implementation approach is trigger-based CDC. For each table to be captured, triggers are installed for the INSERT, UPDATE, and DELETE events. These triggers typically copy the records into some kind of staging table, from where they are extracted via polling, as above. This approach has the advantage that no special columns in the data model are needed, nor will any events ever be missed. Also, DELETE events can be captured, as the triggers are executed as part of writing transactions themselves. This is also the biggest downside: there can be a non-neglectable overhead on write performance, and DBAs (database administrators) often tend to be skeptical when it comes to installing large quantities of triggers into the database.
Lastly, there’s log-based CDC, as implemented by Debezium, and similar tools. In this approach, change events are extracted asynchronously from the transaction log of the database, such as the binlog in MySQL, the write-ahead log (WAL) in Postgres, or the redo log in Oracle. The transaction log is the “source of truth” of a database: each transaction appends events to it, allowing for recovery in case of failures, as well as for replication. In that light, a log-based CDC tool is like another replication client, as it receives all changes applied by the primary database. Extracting changes from the transaction log means that it is guaranteed that all events are retrieved (including DELETEs); also, there are no limitations or requirements in regards to the data model of applications. Push-based notification interfaces like Postgres’ logical decoding mechanism allow for low-overhead, low-latency CDC without any relevant overhead on the database, and latencies are in the range of milliseconds.
Log-based CDC can be somewhat complex to deploy. For instance, the database may have to be re-configured to enable it. Also, there’s no standardized interface for retrieving change notifications from the log of a database; APIs and event formats differ between vendors and, in some cases, even database versions. This also means that if there isn’t a Debezium or other log-based CDC connector for a specific database, you need to explore one of the alternative approaches. That being said, log-based CDC generally is the most powerful approach for retrieving change events from a database, and it should be the preferred option if available.
Now, what does a data change event look like? In the case of Debezium, the structure of the event payload looks like this:
“email”: “[email protected]”
“email”: “[email protected]”
As you can see, there are three parts to each change event:
- Before: the old state of the affected database row in case of an update or delete event; the structure of that before block resembles the structure of the table from which this event originates, in this case, the “customers” table of some e-commerce application
- After: the new state of the row in case of an update or insert event; again, its structure is that of the source table; in the example above, the value of the “email” column has changed
- Metadata: metadata like the type of the operation (“op”), the timestamp of the change (“ts_ms”), and additional information about the source database and table, transaction id, connector name, version, etc.
Over the last few years, the Debezium change event format has established itself as a de-facto standard. Debezium-compatible connectors are not only provided by the project itself but also by other database vendors like ScyllaDB and Yugabyte, which have taken the Debezium connector framework and event format as the foundation for their own CDC connectors. Another example is Google, which just recently announced a Debezium-based CDC connector for their Cloud Spanner database.
When propagating change events to consumers, ensuring correct ordering semantics is very important. While no global ordering is typically required (e.g., across all purchase orders or all customer records), correct ordering of the events pertaining to the same source row is vital. Otherwise, if, for instance, a consumer would receive two update events for the same record in reverse order, then it would end up with an incorrect representation of that record. Therefore, when using popular data stream platforms like Apache Kafka as the transport layer for propagating change events to consumers, the record’s primary key is typically used as the partitioning key for the change event. That way, all the events pertaining to the same source record will be written to the same partition of the Kafka topic, ensuring they arrive in the exact same order they were produced.
Having discussed different means of implementing CDC and what change events commonly look like, let’s now dive into some common use cases for this technology. A first big category of use cases is replication: propagating change events to other data stores addresses a wide range of query requirements, which typically can or should not be handled by operational databases. This ranges from copying data into a separate database for the purposes of offline analysis, over feeding data into full text search systems like Elasticsearch, to updating cloud data warehouses like Snowflake and real-time analytics datastores such as Apache Pinot. A related use case is leveraging change events to drive cache updates, for instance, to keep a read view of data in close proximity to the user, allowing for very short response times.
Going beyond plain data replication, CDC can help to address a number of use cases in the context of microservice architectures. It can be used for implementing the outbox pattern, facilitating reliable data exchanges between different microservices, avoiding unsafe dual-writes to a service’s own database and a streaming platform like Kafka, which in the absence of distributed (XA) transactions is prone to inconsistencies in failure scenarios.
The strangler fig pattern comes in handy when migrating from a monolithic system design to a microservice architecture: components of the monolith are gradually extracted into equivalent microservices, while a routing component in front of the entire system sends incoming requests either to the monolith (for requests served by components extracted not yet) or to the right microservice. Change data capture can be used in this context for capturing change events from the database of the monolith and streaming them over to the extracted microservice(s). That way, a microservice can, for instance, already implement read views of data (e.g., displaying the list of pending purchase orders of a customer), while writes for that data are still handled by the monolith (e.g., placing a new purchase order).
But it doesn’t stop there; CDC also can be used to create audit logs (a persisted change event stream essentially can be considered that), drive incremental updates to materialized views on your data, or for more specific applications such as within SaaS architectures, for propagating changes to the desired configuration state from the control plane over to the data plane.
While change data capture is a powerful enabler for many exciting real-time data use cases, just by itself, it is not enough. After all, publishing change events to a Kafka topic is just a means to an end, and the events need to be propagated to their final destination. In addition, solely taking change events as-is is not enough: you may need to filter them (for instance, to exclude data of specific tenants), project them (for instance, to exclude large BLOB columns), modify them (for instance, to normalize date formats or project only a subset of the fields of a table row), join them, group and aggregate them, and much more.
This is where stream processing platforms come in: they can ingest and process change events either from streaming platforms (e.g., Kafka or AWS Kinesis) or by running them as “native” connectors within the stream processing platform itself. The processed events then can again be written to another topic in Kafka (or similar), a database, a data warehouse, etc. That way, such stream processing platforms allow you to implement entire end-to-end data integration pipelines from source (CDC) over as many processing steps as needed to sink.
Apache Flink is a particularly interesting example in this context, as it comes with powerful change stream processing capabilities built in. Also, there’s the Flink CDC framework which integrates Debezium into the Flink ecosystem.
With the Flink DataStream and Table APIs, as well as Flink SQL, two options for implementing change stream pipelines exist. The first as powerful imperative APIs which can be used via Java and Python, the latter as a fully declarative approach, appealing not only to software developers but also to SQL-savvy data engineers. When it comes to running and operating Flink-based data pipelines, users have a set of options, ranging from running everything themselves on their own infrastructure to fully managed SaaS offerings. In the case of the latter, you only focus on your actual stream processing logic, no matter whether it’s Java/Python or SQL, and then pass on the processing job to the SaaS platform for execution. This can substantially reduce the cost and time-to-market for new pipelines (CDC-based or otherwise), as it frees you from figuring out all the details of running Flink safely, reliably, and efficiently.
Although CDC is not new, modern demands for real-time ingestion and serving the freshest possible data to applications gives you an opportunity to re-evaluate the pros and cons of the different techniques. Log-based CDC with open source tools like Debezium can be a powerful companion with modern, distributed data stores like Apache Pinot and stream processing frameworks like Apache Flink, whether you’re consuming the open source implementation or a managed service like the one I’m helping build at Decodable. Either way, we’re almost certainly entering an era where CDC becomes a powerful, necessary element of the modern, real-time data stack.