Efficient saving of long strings with recurring substrings

Question

following problem: I need to save a lot of xml strings of variable length and structure. As it is with xml, a lot of substrings are the same (some elements, attribute and value combination). Often the whole document is the same, except for a few small parts.

Because I have a lot of these strings, I need to save the xml structures as size-efficient as possible.

One idea I thought about, is replacing often occuring string with variables. Let's say attribute=verylongvalueetcetc appears in a lot of xml structures, my idea would be replacing the string with place holder like #1# (that would save 26 chars), and then replace it when I need it again.

Can anybody think of better ways or methods?

//Edit: In the end, I want to save the strings in several rows in MongoDB

You are reinventing compression. Don't try to write a clever space-saving routine - take advantage of one of the many, many space-saving routings that already exist. It doesn't matter whether you use a compressed file system, a zip archive or any other method. — Kilian Foth
– Kilian Foth, Commented Apr 26, 2016 at 13:39
Yes. mongodb.com/blog/post/new-compression-options-mongodb-30 — Kilian Foth
– Kilian Foth, Commented Apr 26, 2016 at 14:05
"I need to save the xml structures as size-efficient as possible" - why/what is your goal here? Just to save disk storage? Or do you want to have the data in a compressed form to minimize network traffic between your database server and the clients? What order of magnitude do you have in mind for the data? — Doc Brown
– Doc Brown, Commented Apr 26, 2016 at 19:22

Arseni Mourzenko · Accepted Answer · 2016-04-26 14:03:52Z

This is essentially how many compression algorithms work (if you're interested, here's a detailed explanation of GZIP), which is also why are they so efficient when compressing text in general and XML specifically.

If I were you, I would start by asking myself the following questions:

Is the size of data actually important? At $0.0300 per GB for an extremely reliable storage, keeping a few gigabytes of data is extremely cheap.
If the data size actually matters (for instance if we talk about storing thousands of terabytes of data, or if we simply have a few megabytes of it but need to transfer all of it on regular basis through a slow connection or if we have a few kilobytes of it to store on embedded hardware which has limited memory), is JSON an option? Would the benefit be high enough?
Since you mentioned MongoDB, I don't understand why are you using XML in the first place (I imagine that you are storing it as a BLOB in a document in MongoDB). If you have an object you want to store in MongoDB, don't serialize it to XML in order to store it as a BLOB. Send it as-is to MongoDB and let MongoDB handle the job of storing the data efficiently.
Independently of the answer to the previous question, what about using ordinary compression? In most languages/frameworks, using an existent compression algorithm should be straightforward. The only problem is the CPU load if the data is actually on embedded hardware with a very, very slow CPU (I expect that this would be irrelevant for the devices such as smartphones, tablets, desktop computers and servers). You need then to benchmark both approaches and find which one is better for you.

Stack Exchange Network

Efficient saving of long strings with recurring substrings

1 Answer 1

Your Answer

Hot Network Questions

Efficient saving of long strings with recurring substrings

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions