1

Sorry if this has already been asked.

  • I have about 1 million text documents contained in psql
  • I am trying to see if they contain certain word, for example cancer, or died or heart_attack etc. This list is also quite long.
  • The document only needs to contain one of the words.
  • If they contain a word, I then try to copy them to a different folder.

My current code is:

  directory = "disease"     #Creates a directory called heart attacks
  FileUtils.mkpath(directory)   # Makes the directory if it doesn't exists

  cancer = Eightk.where("text ilike '%cancer%'")
  died = Eightk.where("text ilike '%died%'")

  cancer.each do |filing|   #filing can be used instead of eightks
  filename = "#{directory}/#{filing.doc_id}.html"
  File.open(filename,"w").puts filing.text
  puts "Storing #{filing.doc_id}..."


  died.each do |filing|     #filing can be used instead of eightks
  filename = "#{directory}/#{filing.doc_id}.html"
  File.open(filename,"w").puts filing.text
  puts "Storing #{filing.doc_id}..."

  end

end

But this is not working for the following

  • Doesn't match the exact word

  • Is very time consuming since it contains lots of coping the same code and changing just one word.

So I have tried using Regexp.union as follows but am a bit lost

    directory = "disease"       #Creates a directory called heart attacks
    FileUtils.mkpath(directory)     # Makes the directory if it doesn't exists


    keywords = [/dead/,/killed/,/cancer/]

    re = regexp.union(keywords)

So I am trying to search the text files for these keywords and then copy the text documents.

Any help is really appreciated.

1 Answer 1

1

Since you said:

I have about 1 million text documents contained in psql

and use "iLike" text search operator to search words in those documents.

IMHO, that is an inefficient implementation because your data is huge, your query will process all 1 million text documents for every search and it will be very slow.

Before moving forward, I think you should take a look at PG Full Text Searching first. (if you simply want to use built-in full text search in PG) or you could also take a look at some other products like elasticsearch, solr etc. that are dedicated to text search problem.

Regarding PG full text search, in Ruby, you could use pg_serach gem. Though, if you use Rails, I wrote a post about simple full text search implementaion with PG in Rails.

I hope you may find this useful.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.