Skip to content

Latest commit

 

History

History

README.md

Flink CDC Docker Environment

A comprehensive Apache Flink CDC (Change Data Capture) environment with SQL Server and PostgreSQL integration using Docker Compose.

What is CDC and Flink?

Change Data Capture (CDC) is a technology that monitors and captures data changes in a database as they occur. Instead of periodically checking for changes, CDC provides real-time notifications when data is inserted, updated, or deleted.

Apache Flink is a stream processing framework that processes data in real-time as it flows through the system. It can handle millions of events per second with low latency and high reliability.

Together: CDC captures database changes instantly, and Flink processes these changes in real-time to transform and route data to other systems.

Architecture

architecture

This project demonstrates real-time data streaming and CDC capabilities using:

  • Apache Flink 1.20.0 - Stream processing engine
  • SQL Server 2022 - Source database with CDC enabled
  • PostgreSQL 15 - Target database for processed data
  • CloudBeaver - Web-based database administration tool

Components

Flink Cluster

  • JobManager: Coordinates job execution and resource management
  • TaskManager1 & TaskManager2: Execute streaming tasks (16 slots each)
  • Job Submitter: Automatically uploads and runs JAR files

Databases

  • SQL Server: Source database with CDC enabled for customers and sellers tables
  • PostgreSQL: Target database for transformed data storage

Flink Jobs

  1. Data Generator Job (mssql_job): Generates synthetic data for testing
  2. CDC Processing Job (cdc_job): Processes CDC events and loads to PostgreSQL

Quick Start

Prerequisites

  • Docker and Docker Compose
  • Java 11+ (for building JAR files)
  • Maven 3.6+ (for building projects)

Setup

  1. Clone and navigate to project:
git clone <repository-url>
cd cdc_docker
  1. Start the complete environment:
docker compose up

That's it! The Docker Compose configuration includes everything needed:

  • Pre-built Flink jobs are available in flink_jobs_jars/ directory
  • All services (Flink cluster, SQL Server, PostgreSQL, CloudBeaver) start automatically
  • Flink jobs are automatically submitted and executed
  • The complete MSSQL → PostgreSQL data integration pipeline starts immediately
  1. Verify services:

Optional - Rebuild jobs (only if you modify the source code):

# Build data generator
cd mssql_job
mvn clean package
cp target/mssql-data-generator-1.0-SNAPSHOT.jar ../flink_jobs_jars/

# Build CDC job
cd ../cdc_job
mvn clean package
cp target/cdc_job-1.jar ../flink_jobs_jars/cdc_job-1.jar
cd ..

Configuration

SQL Server CDC Setup

The environment automatically:

  • Enables CDC on the database
  • Starts SQL Server Agent (required for CDC)
  • Creates customers and sellers tables
  • Configures CDC for both tables
  • Sets up CDC capture jobs for real-time change tracking

PostgreSQL Setup

  • Creates target tables with CDC metadata columns
  • Configures connection for Flink jobs

Automatic Dependencies

The system automatically downloads required JAR files:

  • PostgreSQL JDBC Driver
  • SQL Server JDBC Driver
  • Flink SQL Gateway API

Usage

Monitoring

  • Flink Dashboard: Monitor job status, metrics, and logs
  • CloudBeaver: Query both source and target databases
  • Docker Logs: View container logs for troubleshooting

Data Flow

  1. Data Generator creates INSERT/UPDATE/DELETE operations in SQL Server
  2. CDC captures changes in real-time
  3. Flink processes and transforms the data
  4. Processed data is loaded into PostgreSQL with operation metadata

Sample Queries

Check CDC data in PostgreSQL:

-- View customer changes
SELECT * FROM customers ORDER BY cdc_timestamp DESC LIMIT 10;

-- View seller changes  
SELECT * FROM sellers ORDER BY cdc_timestamp DESC LIMIT 10;

-- Count operations by type
SELECT operation, COUNT(*) FROM customers GROUP BY operation;

Troubleshooting

Common Issues

  1. Port conflicts: Ensure ports 8081, 1433, 5432, 8978 are available
  2. Memory issues: Increase Docker memory allocation if needed
  3. Job failures: Check Flink logs in the web UI

Logs

# View specific service logs
docker compose logs jobmanager
docker compose logs taskmanager1
docker compose logs mssql
docker compose logs postgresql

# Follow logs in real-time
docker compose logs -f flink-job-submitter

Restart Services

# Restart specific service
docker compose restart jobmanager

# Restart all services
docker compose down && docker compose up -d

# Clean restart (removes volumes)
docker compose down -v && docker compose up -d

Development

Adding New Jobs

  1. Create JAR file and place in flink_jobs_jars/
  2. Restart flink-job-submitter service
  3. Monitor job execution in Flink UI

Modifying Database Schema

  1. Update init.sql for SQL Server changes
  2. Update init_postgres.sql for PostgreSQL changes
  3. Rebuild and restart containers

Performance Tuning

Flink Configuration

  • TaskManager slots: 16 per instance (32 total)
  • Parallelism: Configurable per job
  • Checkpointing: Disabled by default

Database Optimization

  • SQL Server: CDC cleanup configured
  • PostgreSQL: Indexes on frequently queried columns