Currently I am using scrapy to parse a large XML file from an ftp server into elasticsearch. It works but seems quite a heavy weight solution and it uses a lot of memory too.
I am wondering if I am better off writing a plugin for ES instead. I know logstash can do it but I can't do inline language detection etc with that.
A) if I write an actual plugin for ES I think it has to be in Java to pull in the data. Is there any advantage in this approach or could I write a separate Python script to push the data in instead. Is there any clear reason for selecting one method over the other (assuming I don't know Java or Python)
This comes down to:
- Would the memory management be better with an actual ES plugin
- Is Java better suited to processing XML than say, Python?