3

I have a stream of sentences (tweets) and over 10 million names. I want to determine if a single sentence (tweet) contains mention of one of the 10 million names. I could compile regex for all the possible patterns but I would really like to know if there is an efficient algorithm to do that.

Thanks,

1
  • What do you mean: ten million regexes, or one regex with all all ten million names joined together into an alternation? Either way, it sounds like more fun than a human should be allowed. ;) But seriously, this is not a job for regexes. Commented Sep 22, 2012 at 18:25

4 Answers 4

3

You could build a trie (a prefix tree).

Sign up to request clarification or add additional context in comments.

Comments

3

You might try using Bloom filter. Demo here.

1 Comment

Thanks. This is interesting. Bloom fliter might help. I will try it now.
0

I don't think you need pattern matching at all, if you only seek for the occurence of a simple string (name). If you are actually aiming at twitter names -- are they not prefixed with an @ sign when mentioned in tweets? If so, at first just seek for the @ sign and proceed from there.

To check if the string after the @ is one of your 10 million strings, a prefix tree as proposed by ruakh is definitely a good idea .

1 Comment

Thanks. It is not always the case that they are prefixed with an @. Some brand names are not.
0

You could go about it from the other way round. As the sentence comes in, split it into tokens and build a RegEx Pattern for each token, something like ^token\s*. Compare each of those against your 10 million names assuming each on line.

1 Comment

Thanks but this involve chunking the sentence to detect nouns which is pretty expensive to do given millions of sentences. I hope I understood your suggestion correctly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.