Kafka Stream Basic Operations

--

So you know kafka, that’s great! Now let’s see kafka stream!

Consider this use case :

  1. We have purchase system that has data containing customer name, customer credit card number, and customer purchase detail.
  2. This data is sent to kafka topic t.purchase
  3. We have kafka listener from logistic system.
  4. In logistic system, customer name must remains as is (e.g. Peter Parker), and credit card data must be removed, and some of the purchase detail must be masked (e.g. item price must be 3*** instead of 3500).

Using kafka as data pipeline, we can write consumer-producer for logistic system. This will consume data from t.purchase, implements logic for point #4 above, and send transformed data to t.purchase-logistic. Assuming we use Java Spring for Kafka, we will have code like this:

@KafkaListener(topics = "t.purchase")
public void transformLogistic(PurchaseMessage original) {
LogisticMessage transformed = transformForLogistic(original);
kafkaTemplate.send("t.purchase-logistic", transformed);
}
private LogisticMessage transformForLogistic(PurchaseMessage original) { // remove credit card data
// mask item price
// return the transformed message
}

This is a simple use case, but what if the use case developed more. For example, because we are an eco-friendly company, we need to send all data that contains hazardous material to topic t.purchase-logistic-hazard, while non-hazardous material to t.purchase-logistic-no-hazard. We need to enhance the listener (and producer) above to transform data. Something like this:

@KafkaListener(topics = "t.purchase")
public void transformLogistic(PurchaseMessage original) {
LogisticMessage transformed = transformForLogistic(original); var topic = transformed.isHazard() ?
"t.purchase-logistic-hazard" :
"t.purchase-logistic-no-hazard";

kafkaTemplate.send(topic, transformed);}

Well, the good news is, we already have library for this. The library contains common operations for data transformation that can be chained -one operation after another-.

That library is Kafka Stream.

Kafka Stream

Kafka Stream is a small, lightweight java library. The idea is kafka stream take a topic as input. For every data, kafka stream consume it and transform it, then publish it to another different topic.

Consider Kafka stream like a consumer that has many functions for data processing -transformation, enrichment, aggregation, etc-. Kafka stream at the same time is also producer that publish to another kafka topic, then the consumer can then consume the transformed data directly from output topic (or sink topic). Something like this.

Stream Processing

Wait, what is stream?

A stream is sequence of data that comes in order. We usually represents stream with this kind of diagram, called as marble diagram.

The arrow represents timeline, where left is earliest and right is latest. Each of circle represents one data. The data continuously comes over time, we don’t know when it will ends. Each data is immutable, cannot be changed.

Seems familiar? Kafka topic is one good example for stream.

You might familiar with batching. In batching, we process the data every certain time interval (e.g. every 15 minutes at clock 20:00, 20:15, 20:30, …). Stream processing is working with the data as soon as it arrives on data stream. So if the data X arrives at 20:00:04, it will be processed soon. In the contrary, if we use batch, data X need to wait for another batch on 20:15:00.

Also, if there is no transaction happened during 21:00–22:00, stream processing will not runs.

Back To Kafka Stream

Kafka Stream has many common operations that can be used to transform data according to our needs, and that data transformation is happened as soon as data arrives. It is lightweight library, but quite powerful.

Furthermore, when dealing with data streaming, there will always be chance that data sent more than once, e.g. due to network failure or application crash. This is known as “at-least-once” semantic (both Kafka and RabbitMQ has this semantic).

The good news : Kafka Stream has capability to do “exactly-once” semantic, to avoid data sent multiple times. You can even do “exactly-once” without coding.

To give brief overview of what common operations Kafka Stream has, see this short video.

Note for the video:

  • Each node in this video represents one kafka record, with key on the left side and value at the right side.
  • Upper timeline is original data stream, and lower timeline is transformed data stream.
  • There are two types of stream : KStream, and KTable. The differences between them explained in detail here
  • Intermediate operation is an operation that produces another Kstream or KTable to be processed further, or we can say “chained operation”.
  • Terminal operation is returning void, and cannot be processed further. Consider terminal operation as “final” operation.

This is just a very basic operation of Kafka Stream, if you want to know more about how Kafka and Kafka Stream works using Java Spring (that simplifies your development effort), refer here.

Stay tuned.

--

--