1

First of all, sorry for my bad english.

I googled for this question, but had no good info about it.

I own a text with about 3 million words. My need is to do a search for words in this text, I have a list of all those words and I need to find the same in this text, I would like the help of companions for a good idea how to do this with a result of meaningful time.

Thanks for the help.

Best regard´s.

3
  • 1
    Will you need to search for words several times or just once? Commented Dec 9, 2011 at 12:19
  • Do you need to find all the instances of a specific word, or just decide if the word is in the text? Commented Dec 9, 2011 at 12:25
  • Sorry for the less of data in the post, I need to search each word all the time in the input text. I´m reading this input text via .txt loaded with BufferedReader. Thanks in advance. Commented Dec 9, 2011 at 12:56

4 Answers 4

4

Have a look at lucene: http://lucene.apache.org/java/docs/index.html

Sign up to request clarification or add additional context in comments.

3 Comments

I already used Lucene with Zend Framework at php, but for one search at one text, could I be used to several searchs inside a text loaded with BufferedReader? Thanks.
Sure its possible: to gain an idea have a look at this example javatechniques.com/blog/lucene-in-memory-text-search-example
Thanks again, I will test this example with a .txt file as entrance and an ArrayList with the words to match.
1

It would be very inneficient to search the text each time from the text file.

If memory is not a constrain you can add each word in Arraylist and do binary search by

Collections.binarySearch() API

4 Comments

That´s a great idea too, load each word to the coolection and search by binary, but I need to convert the words to binary to match the words in the list, don´t? Thanks.
@Rodrigo Ferrari: binary search has nothing to do with binary format. It is an efficient recursive algorithm that partitions the collection and searches on increasingly smaller sub-collections. It finds an element in O(log(N)) instead of O(N), but requires the collection to be sorted.
@Rodrigo Ferrari. Not needed. U can use the normal string comparisons. As Tudor said its the most efficient way to search sorted item.
Thanks, will try these usage and the Lucene idea.
1

Check these libraries, http://johannburkard.de/software/stringsearch/

1 Comment

Great one, it will be studed during the software development. Thanks!
1

If you need to search for the words only once, then I don't think you can do better than just a linear search over the text.

If you need to do several searches, then you will need to index your text and maybe use something like Lucene.

1 Comment

Yeah, I need to search over the text several times, 50.000 times is the small search at a text with 3 million words. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.