Parsing / Extracting Text from String in Rails?

Question

I have a string in Rails, e.g. "This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!", and I want to parse the text and extract only certain phrases, like "War & Peace by Leo Tolstoy".

Is this a matter of using Regex and lifting the text between "#books" to "."?

What if there's no structure to the message, like: "This is a Twitter message #books War & Peace by Leo Tolstoy I love this book!" or "This is a Twitter message. I love the book War & Peace by Leo Tolstoy #books" How can I reliably pull the phrase "War & Peace by Leo Tolstoy" without knowing the phrase ex ante.

Are there any gems, methods, etc. that can help me do this?

At the very least, what would you call what I'm trying to do? It will help me search for a solution on Google. I've tried a few searches on "parsing" with no luck.

--- edit --- based on @rogeliog suggestion, I will add the following:

I can live with the garbage text that comes after #books, but nothing before. I tried "match.(/#books.*/)" -- results here: www.rubular.com/r/gM7oSZxF5M.

But how can I capture Result #6? (e.g., when someone puts #books at the end of the sentence)?

Is there a way for me to do an if-then with regex? Something like:

if [#books is at the end of the message],

then [take the last 10 words preceding #books],

else [match.(/#books.*/)]

If you offer a regex, please post your solution via a permalink using rubular.com

I think it is called Data Mining.

bassneck
– bassneck

2011-06-25 00:38:21 +00:00
Commented Jun 25, 2011 at 0:38 — bassneck
– bassneck, Commented Jun 25, 2011 at 0:38

cbley · Accepted Answer · 2011-06-25 00:44:55Z

2

I think what you're going to need is Natural Language Processing. It's a very large field and has many techniques and applications. With Ruby in particular you may want to look at the Ruby Linguistics project.

Good luck to you, parsing and processing natural language is not an easy thing to do.

answered Jun 25, 2011 at 0:44

cbley

1,7651 gold badge12 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MorningHacker Over a year ago

Thank you for the links. I'll take a look and see what I'm up against.

rogeliog · Accepted Answer · 2011-06-25 07:43:59Z

0

I Think that you are trying to parse some pretty complex variations. Do you have a DB with all the book titles? That will help allot.

To get out the title from the first example("This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!") you can simply:

"This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book".match(/#book.*\./).to_s.gsub("#books",'')

That will return: " War & Peace by Leo Tolstoy."

If you want to do an if else statement depending if #books is at the end or not, you can:

if text.match(/#books$/)
  puts text.match(/([^\s]*\s){10}(#books$)/).to_s
else
  puts text.match(/#books.*/).to_s.gsub("#books",'')
end

That will give you the last 10 words preceding books if #books is at the end, and whatever it is after #books if it is not at the end

I dont really have a better idea, hope that works for you, let me know:)

edited Jun 25, 2011 at 7:43

answered Jun 25, 2011 at 0:46

rogeliog

3,7024 gold badges29 silver badges26 bronze badges

3 Comments

MorningHacker Over a year ago

Nice. This might be good enough to get what I want. Let me give it a shot and I'll get back to you.

MorningHacker Over a year ago

Your suggestion is pretty good. I can live with the garbage text that comes after #books, but nothing before. So I went with "match.(/#books.*/)" -- check out the results: [link] (rubular.com/r/gM7oSZxF5M). How can I capture Result #6? (e.g., when someone puts #books at the end of the sentence) Is there someway for me to do an if-then with regex? Something like: "if [#books is at the end of the message], then [take the last 10 words preceding #books], else [match.(/#books.*/)]". Please post your solution via a permalink using rubular.com

rogeliog Over a year ago

Hello this is the rubular permalink for getting the last 10 words preceding #books when #books is at the end, link, I also shared you the code code login, I dont know if ther is a better way to approach this problem, if you do please let me know :)

Collectives™ on Stack Overflow

Parsing / Extracting Text from String in Rails?

2 Answers 2

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related