4

I have a string in my DB that represents notes for a user. I want to split this string up so I can separate each note into the content, user, and date.

Here is the format of the String:

"Example Note <i>Josh Test 12:53 PM on 8/14/12</i><br><br> Another example note <i>John Doe 12:00 PM on 9/15/12</i><br><br>  Last Example Note <i>Joe Smoe 1:00 AM on 10/12/12</i><br><br>" 

I need to break this into an array of

["Example Note",  "Josh Test", "12:53 8/14/12", "Another example note", "John Doe", "12:00 PM 9/15/12", "Last Example Note", "Joe Smoe", "1:00 AM 10/12/12"]

I am still experimenting with this. Any ideas are very welcomed thank you! :)

2
  • 1
    That's not the format of the string, it's an example. How much variation is there? Asked another way, what criteria do you use to split? Commented May 31, 2013 at 19:24
  • There is no variation Each note will begin right away, then the content will end with a ' <i>' then the name will always end with a space ' ' then a number. THe time and date are seperated with ' on ', and the whole note always ends with '</i><br><br>'. No variation. Commented May 31, 2013 at 19:34

3 Answers 3

3

You could use regex for a simpler approach.

s = "Example Note <i>Josh Test 12:53 PM on 8/14/12</i><br><br> Another example note <i>John Doe 12:00 PM on 9/15/12</i><br><br>  Last Example Note <i>Joe Smoe 1:00 AM on 10/12/12</i><br><br>" 
s.split(/\s+<i>|<\/i><br><br>\s?|(?<!on) (?=\d)/)
=> ["Example Note", "Josh Test", "12:53 PM on 8/14/12", "Another example note", "John Doe", "12:00 PM on 9/15/12", " Last Example Note", "Joe Smoe", "1:00 AM on 10/12/12"]

The datetime element is off format, but perhaps it would be acceptable to apply some formatting on them separately.

Edit: Removed unnecessary + character.

Sign up to request clarification or add additional context in comments.

1 Comment

This is what I was looking for in the first place thank you. I am horrible with Regexp. Definitely going to have to study up on that.
1

You can use Nokogiri to parse out the required text using Xpath/CSS selectors. Just to give you a simple example with bare-bones parsing to get you started, the following maps every i tag as a new element in an array:

require 'nokogiri'

html = Nokogiri::HTML("Example Note <i>Josh Test 12:53 PM on 8/14/12</i><br><br> Another example note <i>John Doe 12:00 PM on 9/15/12</i><br><br>  Last Example Note <i>Joe Smoe 1:00 AM on 10/12/12</i><br><br>")

my_array = html.css('i').map {|text| text.content}
#=> ["Josh Test 12:53 PM on 8/14/12", "John Doe 12:00 PM on 9/15/12", "Joe Smoe :00 AM on 10/12/12"]

With the CSS selector you could just as easily do something like:

require 'nokogiri'

html = Nokogiri::HTML("<h1>My Message</h1><p>Hi today's date is: <time>Firday, May 31st</time></p>")
message_header = html.css('h1').first.content #=> "My Message"
message_body = html.css('p').first.content #=> "Hi today's date is:"
message_sent_at = html.css('p > time').first.content #=> "Friday, May 31st"

2 Comments

Are you saying the html tags should already exist because I am unable to edit the Database's data. It will always be the way I had it before because that's the way it was saved unfortunately for 100,000's of users. I'm trying to fix someones mistake.
@user1977840 That was just an example to get you started. So long as there's some common pattern to the way the HTML data is structured in the database (e.g., date and name data will always be after tag X and before tag Y), you can tailor your Nokogiri selector as needed to select and parse the relevant portions of data. If the HTML isn't well formed you might be better off using XSS selectors instead.
0

maybe this could be useful

require 'date'
require 'time'

text = "Example Note <i>Josh Test 12:53 PM on 8/14/12</i><br><br> Another example note <i>John Doe 12:00 PM on 9/15/12</i><br><br>  Last Example Note <i>Joe Smoe 1:00 AM on 10/12/12</i><br><br>"

notes=text.split('<br><br>')

pro_notes = []

notes.each do |note_e|
  notes_temp = note_e.split('<i>')
  words = notes_temp[1].split(' ')

  temp = words[5].gsub('</i>','')
  a = temp.split('/')

  full_name = words[0] + ' ' + words[1]
  nn = notes_temp[0]
  dt = DateTime.parse(a[2] +'/'+ a[0] +'/'+ a[1] +' '+ words[2])

  pro_notes << [full_name, nn, dt]
end

1 Comment

Perfect. I added a strip in there to get rid of the white space and it worked thank you! :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.