Kafka Streams Custom processing

Question

There is a requirement for me to process huge files, there could be multiple files that we may end up processing in parallel.

Each Row in a specific file would be processed for a rule specific to that file.
Once the processing is complete we would be generating an output file based on the processed records.

One option that i have thought of is each message pushed to the broker will have: the row data + rule to be applied + some co relation ID(would be like an identifier for that particular file)

I plan to use kafka streams and create a topology with a processor which will get the rule with message process it and sink it.

However (I am new to kafka streams hence may be wrong):

The order in which the messages will be processed will not be sequential as we are processing multiple files in Tandom(which is fine because there isn't a requirement for me to do so, moreover i want to keep it decoupled). But then how would i bring it to logical closure, i.e. in my processor how would i come to know that all the records of a file are processed.
Do i need to maintain the records(co relation ID, number of records etc.) in something like ignite.. i am unsure on that though..

One option that is could think of is a stateful store where in the keystore, each message will have total number of records in the file, and it maintains a key like co-relation-id+total number of records as key and keep updating the records processed in value, once these match we can assume that to be a logical closure for that file and sink accordingly.. — vaibhav
– vaibhav, Commented Mar 16, 2018 at 6:39

Kiran Balakrishnan · Accepted Answer · 2018-03-16 14:12:03Z

1

i guess you can set a key and value record aside that could be sent to the topics at the end of the file which would signify the closure of the file. Say the record has a unique key such as -1 which signifies that the eof

answered Mar 16, 2018 at 14:12

Kiran Balakrishnan

2573 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

vaibhav Over a year ago

Yes it could so happen but in that case the question is since the whole paradigm works asynchronously it could so happen that EOF message arrives before few of the messages still not over(wondering if this might happen?), if that is the case should we also maintain the count of processed records in the state?

Kiran Balakrishnan Over a year ago

there would be ordering in a single partition of kafka topics, but cannot guarantee the same between topics.We can rephrase n say that the each file name would be a key and the unique value can be set to indicate the closure of the file

Collectives™ on Stack Overflow

Kafka Streams Custom processing

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related