2

Currently I am using scrapy to parse a large XML file from an ftp server into elasticsearch. It works but seems quite a heavy weight solution and it uses a lot of memory too.

I am wondering if I am better off writing a plugin for ES instead. I know logstash can do it but I can't do inline language detection etc with that.

A) if I write an actual plugin for ES I think it has to be in Java to pull in the data. Is there any advantage in this approach or could I write a separate Python script to push the data in instead. Is there any clear reason for selecting one method over the other (assuming I don't know Java or Python)

This comes down to:

  • Would the memory management be better with an actual ES plugin
  • Is Java better suited to processing XML than say, Python?

1 Answer 1

2

Converting XML to JSON is rather question about understanding actual data in XML, as it can be not so easy to transform to JSON and usually needs additional logic. For this reason, there's no error-proof XML>JSON translators.

If you'll decide to use python to do that, take a look at eTree, lxml and xmltodict. JSON support is in python's stdlib natively.

If you'll decide to try some luck from ES side, look at elasticsearch-xml. It may fit your needs in case of consistent XML.

Talking about python vs java performance for parsing - if performance is a key for you, you can leverage some libraries, that is already optimized at low-level, but generally, good java code should perform better.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.