3

I have a large data set of urls and I need a way to parse words from the urls eg:

realestatesales.com -> {"real","estate","sales"}

I would prefer to do it in python. This seems like it should be possible with some kind of english language dictionary. There might be some ambiguous cases, but I feel like there should be a solution out there somewhere.

3
  • 1
    What about words that are NOT in the dictionary, eg. imgur.com? Commented Jun 13, 2013 at 17:28
  • More ambiguity than you would think... realestatesales Commented Jun 13, 2013 at 17:29
  • I can deal with some ambiguity. Maybe a good system would come up with the most likely parsing.... Commented Jun 13, 2013 at 17:32

3 Answers 3

4

Ternary Search Trees when filled with a word-dictionary can find the most-complex set of matched terms (words) rather efficiently. This is the solution I've previously used.
You can get a C/Python implementation of a tst here: http://github.com/nlehuen/pytst

Example:

import tst
tree = tst.TST()
#note that tst.ListAction() assigns each matched term to a list
words = tree.scan("MultipleWordString", tst.ListAction())

Other Resources:

The open-source search engine called "Solr" uses what it calls a "Word-Boundary-Filter" to deal with this problem you might want to have a look at it.

Sign up to request clarification or add additional context in comments.

Comments

2

This is a problem is word segmentation, and an efficient dynamic programming solution exists. This page discusses how you could implement it. I have also answered this question on SO before, but I can't find a link to the answer. Please feel free to edit my post if you do.

Comments

2

This might be of use to you: http://www.clips.ua.ac.be/pattern

It's a set of modules which, depending on your system, might already be installed. It does all kinds of interesting stuff, and even if it doesn't do exactly what you need it might get you started on the right path.

2 Comments

any particular module you would recommend?
Off the top of my head you could combine a Wordlist with a generated synset from WordNet using the respective modules from here: clips.ua.ac.be/pages/pattern-en then search for substrings (words) potentially included in your superstrings (URLs). This method would not be time-efficient, but the concepts it requires might help you seek a better solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.