A Kafka Connect Single Message Transform (SMT) that detects and filters out duplicate messages based on a unique key field in the record. This is useful in streaming pipelines where upstream systems may occasionally produce repeated records.
Supports both schemaless and schema-based records, and allows flexible configuration of how to find and identify the unique key.
- Detects and drops duplicate Kafka messages
- Configurable unique key for identifying duplicates
- Supports both schema-based and schemaless records
- Supports field search strategies:
recursive
orpath
- Pluggable cache strategies (currently supports:
in_memory
) - Automatic cache clearing based on a specified ms interval
Even though some Kafka Connect connectors and Kafka clients support exactly-once semantics, duplicates can and do still happen in real-world pipelines due to:
- 🔁 Connector restarts or rebalancing
- 🐛 Bugs in connectors or custom SMTs (Single Message Transforms)
- 🧱 Reprocessing historical data
- ⛓️ Complex pipelines with multiple Kafka topics, microservices, or enrichment layers
- 💥 Failures during sink operations (e.g. DB write failure, API timeouts)
This tool provides a defense-in-depth layer to:
- 🚫 Detect and ignore duplicates based on configurable fields
- 🧪 Enhance observability by surfacing duplicate messages
- 💼 Help teams working with non-idempotent sinks (e.g., databases, payment APIs)
Use it as a pluggable strategy for duplicate detection when reliability and correctness are key.
Config Key | Type | Required | Default | Description |
---|---|---|---|---|
unique.key |
string | ✅ | - | The field name used to uniquely identify each record |
cache.method |
string | ❌ | in_memory |
Options: in_memory , redis |
field.search.strategy |
string | ✅ | path |
Options: path , recursive |
enable.cache.clear |
boolean | ❌ | false |
Enables periodic cache clearing |
clear.cache.ms |
long | ❌ | 1000 |
Interval in milliseconds for clearing the cache |
The field.search.strategy
config allows two strategies for finding the unique.key
in a message:
- How it works: Uses dot-notation to traverse nested fields (
a.b.c
). - Performance: ✅ Fastest. Direct lookup without unnecessary recursion.
- Best for: Consistently structured records with known nesting.
- Fails if: Any part of the path doesn't exist.
🧪 Example:
With record:
{ "outer": { "inner": { "id": "abc123" } } }And
unique.key=outer.inner.id
➜ value"abc123"
is found.
- How it works: Recursively inspects all nested fields until it finds a match for the field name.
- Performance: ❌ Slower. Visits lots of nested fields before finding match.
- Best for: Inconsistent or deeply nested record structures.
- Fails if: Field name doesn't exist at all (but less brittle than
path
).
🧪 Example: With record:
{ "meta": { "tracking": { "id": "abc123" } } }And
unique.key=id
➜ value"abc123"
is found by deep search.
- Keeps a simple map of keys in memory.
- ✅ Fast, but not distributed.
- ❌ All state is lost on restart.
- Distributed cache.
- Survives process restarts.
- Slower than in-memory, but suitable for horizontal scaling.
To prevent unbounded memory growth or to support time-based deduplication, this transformation supports automatic cache clearing at a configurable interval.
Property | Type | Default | Description |
---|---|---|---|
enable.cache.clear |
boolean | false |
Whether to enable automatic cache clearing |
clear.cache.ms |
long | 60000 |
How often (in milliseconds) to clear the cache |
The cache is cleared in a background thread, independently of record processing.
Clearing the cache too frequently (e.g., every few milliseconds) may reduce deduplication accuracy and introduce unnecessary CPU overhead.
A thread-safe cache (like ConcurrentHashMap) is used by default to ensure safe access from multiple threads.
For Redis-based caches, frequent clearing could increase I/O and reduce efficiency — consider using Redis TTLs configuration instead. (Coming soon)
You can download the latest compiled .jar file from the Releases section of this repository.
🔽 Latest Release Go to the Releases page.
Find the most recent version (e.g., v1.0.0).
Download the .jar file under Assets.