Spark Streaming Kafka

This module provides an alternative Kafka integration for legacy Spark Streaming DStreams, with prefetching consumers, pooled reuse, and admin-based offset management. It is more extensible and configurable than the original integration.

Requires Spark 4 and Java 21+.

Improvements vs official spark-streaming-kafka

Compared to the official Spark Streaming Kafka integration, this rewrite adds:

Prefetching consumers that poll on a dedicated virtual thread and feed a bounded queue, smoothing fetch/processing, enabling async I/O during batch processing, and applying natural backpressure.
A cached consumer pool with size and TTL controls to avoid reconnects and preserve prefetch state for hot partitions.
Driver-side offset discovery and commits via Kafka Admin APIs rather than the Consumer API.
Optional offset range splitting to reduce skew across Spark tasks and automatically adjust the number of partitions per batch.
A more resilient driver loop, with retry hooks planned to avoid crashes under transient failures.

Usage

import org.apache.kafka.clients.admin.AdminClientConfig
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext, Time}
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.kafka.consumer.pool.ConsumerPoolStrategies
import org.apache.spark.streaming.kafka.offset.OffsetStoreStrategies

val sparkConf = new SparkConf().setAppName("KafkaDStreamExample")
val ssc = new StreamingContext(sparkConf, Seconds(5))

val adminClientParams = Map[String, Object](
  ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "kafka:9092",
  AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG -> 5000
)

val executorKafkaParams = Map[String, Object](
  ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "kafka:9092",
  ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer].getName,
  ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer].getName
)

val stream = KafkaDStream.builder[String, String](ssc)
  .withAdminClientParams(adminClientParams)
  .withSubscription(SubscriptionStrategies.Subscribe(Set("topic-a")))
  .withOffsetStore(OffsetStoreStrategies.NativeKafka(groupId = "stream"))
  .withConsumerPool(ConsumerPoolStrategies.CachedConsumer(executorKafkaParams))
  .build()

stream.foreachRDD { (rdd, time: Time) =>
  // Process records...
  stream.commitAsync(time)
}

Configuration

Spark configs specific to this module:

Consumer:

spark.streaming.kafka.consumer.poll.ms (default: spark.network.timeout * 1000)

Consumer poll timeout in milliseconds.
spark.streaming.kafka.consumer.prefetch.queueCapacity (default: 3 * max.poll.records)

Prefetch queue capacity (records); larger values smooth bursts but increase memory use per executor.
spark.streaming.kafka.consumer.cache.maxSize (default: 32)

Max cached consumers per executor.
spark.streaming.kafka.consumer.cache.expireAfterAccessSec (default: 300)

Cache TTL in seconds.

Rate limits:

spark.streaming.kafka.maxRatePerPartition (default: +inf)

Upper bound per partition per batch; set to cap input during spikes.
spark.streaming.kafka.minRatePerPartition (default: 0)

Lower bound per partition per batch.

Offsets:

spark.streaming.kafka.offsetRange.maxSizeRatio (default: +inf)

Split skewed offset ranges when a partition exceeds this ratio of the average range size (must be > 1).
spark.streaming.kafka.offsets.maxRetries (default: 3)

Retry count for offset discovery; increase if your cluster sees transient Admin API failures.

Standard Spark backpressure settings apply:

spark.streaming.backpressure.enabled
spark.streaming.backpressure.initialRate

Ideas / TODO

Allow clients to read partitions from the same rack by leveraging Kafka rack awareness and the replica selection features for reading from followers.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Streaming Kafka

Improvements vs official spark-streaming-kafka

Usage

Configuration

Ideas / TODO

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spark Streaming Kafka

Improvements vs official spark-streaming-kafka

Usage

Configuration

Ideas / TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages