1

I have a string in Rails, e.g. "This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!", and I want to parse the text and extract only certain phrases, like "War & Peace by Leo Tolstoy".

Is this a matter of using Regex and lifting the text between "#books" to "."?

What if there's no structure to the message, like: "This is a Twitter message #books War & Peace by Leo Tolstoy I love this book!" or "This is a Twitter message. I love the book War & Peace by Leo Tolstoy #books" How can I reliably pull the phrase "War & Peace by Leo Tolstoy" without knowing the phrase ex ante.

Are there any gems, methods, etc. that can help me do this?

At the very least, what would you call what I'm trying to do? It will help me search for a solution on Google. I've tried a few searches on "parsing" with no luck.

--- edit --- based on @rogeliog suggestion, I will add the following:

I can live with the garbage text that comes after #books, but nothing before. I tried "match.(/#books.*/)" -- results here: www.rubular.com/r/gM7oSZxF5M.

But how can I capture Result #6? (e.g., when someone puts #books at the end of the sentence)?

Is there a way for me to do an if-then with regex? Something like:

if [#books is at the end of the message],

then [take the last 10 words preceding #books],

else [match.(/#books.*/)]

If you offer a regex, please post your solution via a permalink using rubular.com

1
  • I think it is called Data Mining. Commented Jun 25, 2011 at 0:38

2 Answers 2

2

I think what you're going to need is Natural Language Processing. It's a very large field and has many techniques and applications. With Ruby in particular you may want to look at the Ruby Linguistics project.

Good luck to you, parsing and processing natural language is not an easy thing to do.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the links. I'll take a look and see what I'm up against.
0

I Think that you are trying to parse some pretty complex variations. Do you have a DB with all the book titles? That will help allot.

To get out the title from the first example("This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!") you can simply:

"This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book".match(/#book.*\./).to_s.gsub("#books",'')

That will return: " War & Peace by Leo Tolstoy."

If you want to do an if else statement depending if #books is at the end or not, you can:

if text.match(/#books$/)
  puts text.match(/([^\s]*\s){10}(#books$)/).to_s
else
  puts text.match(/#books.*/).to_s.gsub("#books",'')
end

That will give you the last 10 words preceding books if #books is at the end, and whatever it is after #books if it is not at the end

I dont really have a better idea, hope that works for you, let me know:)

3 Comments

Nice. This might be good enough to get what I want. Let me give it a shot and I'll get back to you.
Your suggestion is pretty good. I can live with the garbage text that comes after #books, but nothing before. So I went with "match.(/#books.*/)" -- check out the results: [link] (rubular.com/r/gM7oSZxF5M). How can I capture Result #6? (e.g., when someone puts #books at the end of the sentence) Is there someway for me to do an if-then with regex? Something like: "if [#books is at the end of the message], then [take the last 10 words preceding #books], else [match.(/#books.*/)]". Please post your solution via a permalink using rubular.com
Hello this is the rubular permalink for getting the last 10 words preceding #books when #books is at the end, link, I also shared you the code code login, I dont know if ther is a better way to approach this problem, if you do please let me know :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.