Regex to find URL parameters in HTML (Ruby)

Question

I am attempting to replace embedded YouTube videos with thumbnails in dynamically created email templates. I am attempting to find each YouTube ID from each embedded URL, then replace the entire block with custom HTML. I have it working if there is only one embedded video with the following RegEx:

<span contenteditable="false" draggable="true" fr-original-class="fr-video\sfr-dvb\sfr-draggable"\s.*\ssrc="[a-z:]*?\/\/w{3}?.?youtube.com\/embed\/([a-zA-Z\d\-]*).*<\/iframe><\/span>

The problem is, if there is more than one video, it will only find the ID from the last video. I feel like I may be over-complicating this.

Note that the attributes of the span that the embedded video is in will always be the same (contenteditable="false" draggable="true" fr-original-class="fr-video).

A sample email template is below, the above RegEx only pulls the second ID from this, not the first. I would like to pull both.

This is being done in Ruby.

EDIT: I realize the RegEx I am using is probably overkill but I need a complex RegEx for the gsub replace so that I only replace the video and it's container, not anything surrounding it.

<!DOCTYPE html>
<html>
  <head>
    <meta content='text/html; charset=UTF-8' http-equiv='Content-Type'>
  </head>
  <body style='margin: 0px; font-family: Helvetica Neue,Helvetica,Arial,sans-serif; font-size: 18px;'>
    <table border='0' cellpadding='0' cellspacing='0' style='font-family: Helvetica Neue,Helvetica,Arial,sans-serif; width: 600px;' width='600'>
      <tr>
        <td>
          FooBar
          <br>
          <br>
          <span contenteditable="false" draggable="true" fr-original-class="fr-video fr-dvb fr-draggable" fr-original-style="-webkit-user-select: none;" style="-webkit-user-select: none; text-align: center; position: relative; display: block; clear: both;">
            <iframe src="//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&amp;url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&amp;image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube" width="600" height="338" scrolling="no" frameborder="0" allowfullscreen="" style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-class="embedly-embed"></iframe>
          </span>
          <br>
          Foo Bar
          <br>
          <br>
          <span contenteditable="false" draggable="true" fr-original-class="fr-video fr-dvb fr-draggable" fr-original-style="-webkit-user-select: none;" style="-webkit-user-select: none; text-align: center; position: relative; display: block; clear: both;">
            <iframe src="//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/skLz87ixE48?feature=oembed&amp;url=http://www.youtube.com/watch?v=skLz87ixE48&amp;image=https://i.ytimg.com/vi/skLz87ixE48/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube" width="600" height="338" scrolling="no" frameborder="0" allowfullscreen="" style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-class="embedly-embed"></iframe>
          </span>
          <br>
        </td>
      </tr>
      <tr style='font-family: Helvetica Neue,Helvetica,Arial,sans-serif; font-size: 12px; color: #656565; text-align: center;'>
        <td style='padding: 10px 0px;'>
        </td>
      </tr>
    </table>
  </body>
</html>

So if I understand this correctly, you're trying to do 2 things with regex? One of which is remove the <span>...</span>s containing YouTube embeds? And the second is to capture the IDs of those YouTube embeds? — wpcarro
– wpcarro, Commented Jun 29, 2016 at 19:50
@wcarroll that is correct. Doing the two operations separately is fine. I would like to match the IDs of the embeds and for each ID I find, replace the YouTube embed and it's container with custom HTML I generate. My current RegEx finds the beginning of the first embed (<span>) and matches with the end of the second embed (</span>) which is not what I want, obviously. — tommybond
– tommybond, Commented Jun 29, 2016 at 19:53
It's strongly recommended you use a parser rather than regular expressions when working with HTML or XML. See stackoverflow.com/questions/1732348/… for a historical discussion. The defacto parser for Ruby is Nokogiri. Nokogiri makes it easy to find particular nodes, extract information, and modify the DOM without using sub or gsub. — the Tin Man
– the Tin Man, Commented Jun 29, 2016 at 20:06
@theTinMan that definitely makes sense rather than using gsub. Thanks for this reminder. — tommybond
– tommybond, Commented Jun 29, 2016 at 20:14

the Tin Man · Accepted Answer · 2016-06-29 20:34:29Z

Don't use regular expressions for this. There are existing tools to make it much easier:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<!DOCTYPE html>
<html>
  <body>
    <table>
      <tr>
        <td>
          <span>
            <iframe src="//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&amp;url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&amp;image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube" width="600" height="338" scrolling="no" frameborder="0" allowfullscreen="" style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-class="embedly-embed"></iframe>
          </span>
          <span>
            <iframe src="//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/skLz87ixE48?feature=oembed&amp;url=http://www.youtube.com/watch?v=skLz87ixE48&amp;image=https://i.ytimg.com/vi/skLz87ixE48/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube" width="600" height="338" scrolling="no" frameborder="0" allowfullscreen="" style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-class="embedly-embed"></iframe>
          </span>
        </td>
      </tr>
    </table>
  </body>
</html>
EOT

At this point it's easy to search for the <span> tags. Here's the first one:

doc.search('span').first.to_html
# => "<span>\n            <iframe src=\"//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&amp;url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&amp;image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube\" width=\"600\" height=\"338\" scrolling=\"no\" frameborder=\"0\" allowfullscreen=\"\" style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-class=\"embedly-embed\"></iframe>\n          </span>"

last or regular array indexing could be used to find specific instances if necessary.

Instead of using search and first, we can use at instead, which already does them internally:

doc.at('span').to_html
# => "<span>\n            <iframe src=\"//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&amp;url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&amp;image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube\" width=\"600\" height=\"338\" scrolling=\"no\" frameborder=\"0\" allowfullscreen=\"\" style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-class=\"embedly-embed\"></iframe>\n          </span>"

We can dig into a node to grab its parameters:

doc.at('iframe')['src']
# => "//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&key=2aa3c4d5f3de4f5b9120b660ad850dc9&type=text/html&schema=youtube"

Once you have a URL, we have tools for manipulating them too:

require 'uri'
iframe = doc.at('iframe')
uri = URI.parse('http:' + iframe['src'])

We can extract the query:

uri.query # => "src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&key=2aa3c4d5f3de4f5b9120b660ad850dc9&type=text/html&schema=youtube"

We can parse it into a hash, making it easy to pick it apart:

URI::decode_www_form(uri.query).to_h['src']
# => "https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed"

... or modify it:

query = URI::decode_www_form(uri.query).to_h
query['src'] = 'http://example.com'

uri.query = URI::encode_www_form(query)

uri.to_s
# => "http://cdn.embedly.com/widgets/media.html?src=http%3A%2F%2Fexample.com&url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3De7zCqsjK1Vg&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fe7zCqsjK1Vg%2Fhqdefault.jpg&key=2aa3c4d5f3de4f5b9120b660ad850dc9&type=text%2Fhtml&schema=youtube"

Once you're there, it's easy to modify the HTML if necessary:

iframe['src'] = uri.to_s
iframe.to_html
# => "<iframe src=\"http://cdn.embedly.com/widgets/media.html?src=http%3A%2F%2Fexample.com&amp;url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3De7zCqsjK1Vg&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fe7zCqsjK1Vg%2Fhqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text%2Fhtml&amp;schema=youtube\" width=\"600\" height=\"338\" scrolling=\"no\" frameborder=\"0\" allowfullscreen=\"\" style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-class=\"embedly-embed\"></iframe>"

and:

doc.to_html
# => "<!DOCTYPE html>\n<html>\n  <body>\n    <table>\n      <tr>\n        <td>\n          <span>\n            <iframe src=\"http://cdn.embedly.com/widgets/media.html?src=http%3A%2F%2Fexample.com&amp;url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3De7zCqsjK1Vg&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fe7zCqsjK1Vg%2Fhqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text%2Fhtml&amp;schema=youtube\" width=\"600\" height=\"338\" scrolling=\"no\" frameborder=\"0\" allowfullscreen=\"\" style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-class=\"embedly-embed\"></iframe>\n          </span>\n          <span>\n            <iframe src=\"//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/skLz87ixE48?feature=oembed&amp;url=http://www.youtube.com/watch?v=skLz87ixE48&amp;image=https://i.ytimg.com/vi/skLz87ixE48/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube\" width=\"600\" height=\"338\" scrolling=\"no\" frameborder=\"0\" allowfullscreen=\"\" style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-style=\"box-sizing: content-box; max-width: 100%; border: 0px;\" fr-original-class=\"embedly-embed\"></iframe>\n          </span>\n        </td>\n      </tr>\n    </table>\n  </body>\n</html>\n"

This isn't exactly an example of how to solve the problem you're asking about, instead it's a reminder that there are existing well-tested wheels based on the specs and we should use them.

I may have to use a mashup of both methods, I only want to pull <span> nodes that have embedded YouTube videos within them.
No, it's possible to do without complex regex, using Nokogiri and URI. Read about CSS selectors and how to search inside parameters, or learn about XPath. Those have been discussed many times here on SO, and on the internet.
Okay, you were definitely correct. Just got this working really elegantly and simply using Nokogiri. Thanks a lot!
I'm glad it helped. The benefits for using a parser don't really kick in until you've written several scrapers or spiders and see how easy it is to root around in the DOM, or you're parsing XML or manipulating it. Regex break so easily, especially with tiny changes to the HTML or XML, and having to support a fragile solution is enough to make anyone scream.

wpcarro · Accepted Answer · 2016-06-29 20:17:43Z

1

To grab the YouTube IDs, I think the best way would be to use look-arounds. The following should work.

(?<=embed\/)(.+?)(?=\?)

Here's a link to a demonstration on regex101.com

Turn on the "global" flag so that the regex engine doesn't stop after finding the first match. This regex uses a look-behind, (?<=embed\/); followed by a capturing group that matches wildcard characters in a non-greedy fashion, (.+?); followed by a look-ahead that asserts a literal question mark, (?=\?).

This should suffice in grabbing the video IDs.

As for replacing the HTML, here's a regex that will match the <span>...</span> blocks:

<span.*?>\s*<iframe.+?>.*?<\/iframe>\s*<\/span>

For this to work, apply the s flag to the regex engine so that . wildcard characters can match \/n newline characters. Also apply the g flag for the same reasons mentioned previously.

NOTE: this will capture any <span> groups that have <iframe>s as direct children. Depending on the content with which you are working, you may need to add more specificity to the regex to scan the attributes on those <iframe>s. For the content you provided to this question, however, it appears to work.

Let me know if you'd like any clarification or additional functionality.

Here's a link to a demonstration on regex101.com.

edited Jun 29, 2016 at 20:17

answered Jun 29, 2016 at 20:01

wpcarro

1,54610 silver badges14 bronze badges

4 Comments

tommybond Over a year ago

Fantastic, thank you so much for this. The first regex seems to work wonderfully for my purpose, though the second one doesn't seem to work with the example I've posted. I did alter it to <span.+?>\s*<iframe.+?>.*?<\/iframe>\s*<\/span> to account for the <span> attributes, but it still does not seem to be working.

wpcarro Over a year ago

Let me take another look.

wpcarro Over a year ago

How about this? <span.*?>\s*<iframe.+?>.*?<\/iframe>\s*<\/span> I'll edit my answer if this works for you. Make sure the flags are set to g and s. This is working here. regex101.com/r/nF0bQ6/1 Do you have additional content for which this fails?

wpcarro Over a year ago

Great. I'll edit my response and then can you mark it as correct?

Collectives™ on Stack Overflow

Regex to find URL parameters in HTML (Ruby)

2 Answers 2

4 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related