There is a requirement for me to process huge files, there could be multiple files that we may end up processing in parallel.
- Each Row in a specific file would be processed for a rule specific to that file.
- Once the processing is complete we would be generating an output file based on the processed records.
One option that i have thought of is each message pushed to the broker will have: the row data + rule to be applied + some co relation ID(would be like an identifier for that particular file)
I plan to use kafka streams and create a topology with a processor which will get the rule with message process it and sink it.
However (I am new to kafka streams hence may be wrong):
- The order in which the messages will be processed will not be sequential as we are processing multiple files in Tandom(which is fine because there isn't a requirement for me to do so, moreover i want to keep it decoupled). But then how would i bring it to logical closure, i.e. in my processor how would i come to know that all the records of a file are processed.
- Do i need to maintain the records(co relation ID, number of records etc.) in something like ignite.. i am unsure on that though..