Serialize persistent data structures in clojure

Question

We all know that Rich uses a ideal hash tree-based method to implement the persistent data structures in Clojure. This structure enables us to manipulate the persistent data structures without copying a lot.

But it seems I cannot find the correct way to serialize this specific structure. For example given:

(def foo {:a :b :c :d})
(def bar (assoc foo :e :f))
(def bunny {:foo foo :bar bar})

My question is:

How can I serialize the bunny such that the contents of foo, i.e. :a mapping to :b and :c mapping to :d, appear only once in the serialized content? It's like dumping a memory image of the structures. It's also like serializing the "internal nodes" as well as the "leaf nodes" referenced here.

P.S. In case this is relevant, I am building a big DAG (directed acyclic graph) where we assoc quite a bit to link these nodes to those nodes, and want to serialize the DAG for later de-serialization. The expanded representation of the graph (i.e., the content one'll get when printing the DAG in repl) is unacceptably long.

you mean the keys :a and :c and their according values? or really just the element that :a and :b are? — cfrick
– cfrick, Commented Jul 3, 2015 at 10:51
@cfrick Thanks for editing. The full contents of foo should be reused in the construction of bar, and I want the serialization to include the contents of foo only once. I'll edit the question again to clarify. — Davyzhu
– Davyzhu, Commented Jul 3, 2015 at 11:15
It would be good if you can specify also your expectation - how the serialized result of bunny should look like. For me (str bunny) results in {:foo {:c :d, :a :b}, :bar {:e :f, :c :d, :a :b}}, which seems correct. But I don't understand what do you mean by "include contents of foo only once. — Viktor K.
– Viktor K., Commented Jul 3, 2015 at 11:41
@ViktorK. I'm not specific to any implementation as long as I don't have to fully realize the DAG. Like @Frank has said in the answer below, I expect an automatic tokenization like the one used to implement PersistentMap or PersistentVector. — Davyzhu
– Davyzhu, Commented Jul 4, 2015 at 4:53

Frank C. · Accepted Answer · 2015-07-04 10:40:41Z

1

Davyzhu,

Few things first:

The DAG, without tokenization strategy, will be as long as the DAG is. If foo is referenced 1 or more times each will be fully realized (i.e. displayed) in turn during printing.
For the interchanges of the information (serialize and deserialize) it will be largely dependent on your goals. For example, if you are serializing to send it off over the wire you will either want to do it fully (like the printed representation) or you will need to encode individual data points with some identification/tokenization strategy. The latter, of course, assumes the receiving end can deserialize with understanding of the tokenization protocol.
The tokenization strategy example, could use Clojure meta facilities perhaps, would require encoding unique keys for each content block reference and your DAG contains nodes where the edges are represented by the keys.

Edit:: Modified since original post to clarify as per comments but the example does not reflect the hierarchical nature of the DAG.

A contrived example:

(def node1 {:a :b :c :d})
(def node2 {:e :f})
(def dictionary {:foo node1 :bar node2})

(def DAG [:bunny [:foo :bar]])

(println DAG) ; => [:bunny [:foo :bar]]

(defn expand-dag1
  [x]
  (if (keyword? x)
    (get dictionary x x)
    x))

(println (w/postwalk expand-dag1 DAG)) ; => [:bunny [{:a :b, :c :d} {:e :f}]]

Note: Use of vectors, maps, lists, etc. to express your DAG is up to you.

edited Jul 4, 2015 at 10:40

answered Jul 3, 2015 at 13:03

Frank C.

8,1854 gold badges40 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Davyzhu Over a year ago

Can you check again your contrived example? I found it a bit hard to follow. Why the [:node foo] appear twice in it? We use the source node(s) to represent the DAG, and the rest are referenced indirectly. And, further, are you saying I have to implement the tokenization protocol myself even though this has already be done behind the scene?

Frank C. Over a year ago

Davyzhu: Hope the edits help, I mapped more to your example to reflect some structure of the DAG. I'm using vectors based on my experience with doing the same with Ontologies.

Davyzhu Over a year ago

I see. You're showing how to tokenize the graph and differentiate token and content. I'll try traverse the graph and do this as well. Thanks!

johnbakers · Accepted Answer · 2015-07-04 09:29:25Z

1

This is one option (that will not work in Clojurescript, in case that matters) and in general may be perceived as a bad idea, but it is worth mentioning anyway.

If I understand your question, you want the foo in (def bunny {:foo foo :bar bar}) to not be "pasted" as a full copy, but rather retain a "reference" to the (def foo..) such that the original foo map is only serialized once.

One technique I would consider though not necessarily encourage (and only after exhausting other options such as a reorginization of your data structure as hinted by Frank C.), is to serialize the code for bunny rather than the structure itself. Then you read the code string back in and eval it. This would only work if the structure for bunny does not change, or if it does, that you can easily build a string of the bunny map with the relevant symbols included as part of the string, rather than the contents of those symbols.
But a much better idea would be to serialize your "raw" data structures only, like the maps foo and bar, then build your bunny after these are read back in -- by also serializing the structure but not the contents of bunny. I believe this is what Frank's answer is getting at.

Worth noting that if the structure of bunny does change dynamically, and you are able to create a string of symbols as suggested in 1. above, then that means you also have the tools to instead build a representation of bunny as in 2. above, which would be preferable.

Since code is data, option 1. is an example of the type of flexibility available to us as lisp programmers -- but that doesn't mean there are not better options.

answered Jul 4, 2015 at 9:29

johnbakers

24.9k25 gold badges143 silver badges272 bronze badges

2 Comments

Frank C. Over a year ago

hellofunk: I was thinking along similar lines. In the recent edit I added a '''dictionary''' which in the real world would be managed by an atom to be most effective.

Davyzhu Over a year ago

This is interesting! But unfortunately, the code as well as the data used to generate the graph is even more complex and computation intensive. Thanks for answering!

Collectives™ on Stack Overflow

Serialize persistent data structures in clojure

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related