3

I am using the libpostal library to find an full address (street, city, state, and postal code) within a news article. libpostal when given input text:

There was an accident at 5 Main Street Boulder, CO 10566 -- which is at the corner of Wilson.

returns a vector:

[{:label "house", :value "there was an accident at 5"}
 {:label "road", :value "main street"} 
 {:label "city", :value "boulder"}
 {:label "state", :value "co"}
 {:label "postcode", :value "10566"}
 {:label "road", :value "which is at the corner of wilson."}

I am wondering if there is a clever way in Clojure to extract a sequence where the :label values occur in a sequence:

[road unit? level? po_box? city state postcode? country?]

where ? represents an optional value in the match.

1 Answer 1

6

You could do this with clojure.spec. First define some specs that match your maps' :label values:

(defn has-label? [m label] (= label (:label m)))
(s/def ::city #(has-label? % "city"))
(s/def ::postcode #(has-label? % "postcode"))
(s/def ::state #(has-label? % "state"))
(s/def ::house #(has-label? % "house"))
(s/def ::road #(has-label? % "road"))

Then define a regex spec e.g. s/cat + s/?:

(s/def ::valid-seq
  (s/cat :road ::road
         :city (s/? ::city) ;; ? = zero or once
         :state ::state
         :zip (s/? ::postcode)))

Now you can conform or valid?-ate your sequences:

(s/conform ::valid-seq [{:label "road" :value "Damen"}
                        {:label "city" :value "Chicago"}
                        {:label "state" :value "IL"}])
=>
{:road {:label "road", :value "Damen"},
 :city {:label "city", :value "Chicago"},
 :state {:label "state", :value "IL"}}
;; this is also valid, missing an optional value in the middle
(s/conform ::valid-seq [{:label "road" :value "Damen"}
                        {:label "state" :value "IL"}
                        {:label "postcode" :value "60622"}])
=>
{:road {:label "road", :value "Damen"},
 :state {:label "state", :value "IL"},
 :zip {:label "postcode", :value "60622"}}
Sign up to request clarification or add additional context in comments.

2 Comments

I really like this approach. As far as skipping through the text prior to the address, would you recommend just looping through the vector with rest until either it is valid or end is reached?
@frank if your inputs might have invalid prefixes then I suppose they could also have invalid suffixes. If "brute force" isn't too costly it's a fine starting point. You could also try adding wildcard prefix/suffix into your spec e.g. (s/cat :prefix (s/* map?) ... :suffix (s/* map?)) but I wouldn't expect great performance for large inputs. For large inputs you could be better off writing your own state machine/parser.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.