Efficient string matching in java

Question

I have a stream of sentences (tweets) and over 10 million names. I want to determine if a single sentence (tweet) contains mention of one of the 10 million names. I could compile regex for all the possible patterns but I would really like to know if there is an efficient algorithm to do that.

Thanks,

What do you mean: ten million regexes, or one regex with all all ten million names joined together into an alternation? Either way, it sounds like more fun than a human should be allowed. ;) But seriously, this is not a job for regexes. — Alan Moore
– Alan Moore, Commented Sep 22, 2012 at 18:25

ruakh · Accepted Answer · 2012-09-22 16:21:49Z

3

You could build a trie (a prefix tree).

answered Sep 22, 2012 at 16:21

ruakh

185k29 gold badges292 silver badges324 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mindas · Accepted Answer · 2012-09-22 16:26:30Z

3

You might try using Bloom filter. Demo here.

answered Sep 22, 2012 at 16:26

mindas

26.8k15 gold badges103 silver badges158 bronze badges

1 Comment

DotNet Over a year ago

Thanks. This is interesting. Bloom fliter might help. I will try it now.

Ridcully · Accepted Answer · 2012-09-22 16:31:15Z

0

I don't think you need pattern matching at all, if you only seek for the occurence of a simple string (name). If you are actually aiming at twitter names -- are they not prefixed with an @ sign when mentioned in tweets? If so, at first just seek for the @ sign and proceed from there.

To check if the string after the @ is one of your 10 million strings, a prefix tree as proposed by ruakh is definitely a good idea .

answered Sep 22, 2012 at 16:31

Ridcully

23.7k8 gold badges76 silver badges88 bronze badges

1 Comment

DotNet Over a year ago

Thanks. It is not always the case that they are prefixed with an @. Some brand names are not.

milam · Accepted Answer · 2012-09-22 16:31:24Z

0

You could go about it from the other way round. As the sentence comes in, split it into tokens and build a RegEx Pattern for each token, something like ^token\s*. Compare each of those against your 10 million names assuming each on line.

answered Sep 22, 2012 at 16:31

milam

314 bronze badges

1 Comment

DotNet Over a year ago

Thanks but this involve chunking the sentence to detect nouns which is pretty expensive to do given millions of sentences. I hope I understood your suggestion correctly.

Collectives™ on Stack Overflow

Efficient string matching in java

4 Answers 4

Comments

1 Comment

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related