0

There is a requirement for me to process huge files, there could be multiple files that we may end up processing in parallel.

  • Each Row in a specific file would be processed for a rule specific to that file.
  • Once the processing is complete we would be generating an output file based on the processed records.

One option that i have thought of is each message pushed to the broker will have: the row data + rule to be applied + some co relation ID(would be like an identifier for that particular file)

I plan to use kafka streams and create a topology with a processor which will get the rule with message process it and sink it.

However (I am new to kafka streams hence may be wrong):

  • The order in which the messages will be processed will not be sequential as we are processing multiple files in Tandom(which is fine because there isn't a requirement for me to do so, moreover i want to keep it decoupled). But then how would i bring it to logical closure, i.e. in my processor how would i come to know that all the records of a file are processed.
  • Do i need to maintain the records(co relation ID, number of records etc.) in something like ignite.. i am unsure on that though..
1
  • One option that is could think of is a stateful store where in the keystore, each message will have total number of records in the file, and it maintains a key like co-relation-id+total number of records as key and keep updating the records processed in value, once these match we can assume that to be a logical closure for that file and sink accordingly.. Commented Mar 16, 2018 at 6:39

1 Answer 1

1

i guess you can set a key and value record aside that could be sent to the topics at the end of the file which would signify the closure of the file. Say the record has a unique key such as -1 which signifies that the eof

Sign up to request clarification or add additional context in comments.

2 Comments

Yes it could so happen but in that case the question is since the whole paradigm works asynchronously it could so happen that EOF message arrives before few of the messages still not over(wondering if this might happen?), if that is the case should we also maintain the count of processed records in the state?
there would be ordering in a single partition of kafka topics, but cannot guarantee the same between topics.We can rephrase n say that the each file name would be a key and the unique value can be set to indicate the closure of the file

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.