Parsing link from javascript function

Question

I'm trying to parse a direct link out of a javascript function within a page. I'm able to parse the html info I need, but am stumped on the javascript part. Is this something that is achievable with php and possibly regex?

function videoPoster() {
    document.getElementById("html5_vid").innerHTML = 
        "<video x-webkit-airplay='allow' id='html5_video' style='margin-top:" 
        + style_padding 
        + "px;' width='400' preload='auto' height='325' controls onerror='cantPlayVideo()' " 
        + "<source src='http://video-website.com/videos/videoname.mp4' type='video/mp4'>";
}

What I need to pull out is the link "http://video-website.com/videos/videoname.mp4". Any help or pointers would be greatly appreciated!

There is actually a session ID that trails after the .mp4, but of course it will change every time the page is reloaded. — user1941752
– user1941752, Commented Jan 3, 2013 at 2:08
@user1941752 If you can identify the URL by the first occurences of http:// and .mp4, that's what a regex excels at. — John Dvorak
– John Dvorak, Commented Jan 3, 2013 at 2:19

jbabey · Accepted Answer · 2013-01-03 02:27:52Z

2

/http://.*\.mp4/ will give you all characters between http:// and .mp4, inclusive.

See it in action.

If you need the session id, use something like /http://.*\.mp4?sessionid=\d+/

answered Jan 3, 2013 at 2:27

jbabey

46.7k12 gold badges73 silver badges94 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

dspeyer · Accepted Answer · 2013-01-03 02:30:48Z

2

In general, no. Nothing short of a full javascript parser will always extract urls, and even then you'll have trouble with urls that are computed nontrivially.

In practice, it is often best to use the simplest capturing regexp that works for the code you actually need to parse. In this case:

['"](http://[^'"]*)['"]

If you have to enter that regexp as a string, beware of escaping.

If you ever have unescaped quotation marks in urls, this will fail. That's valid but rare. Whoever is writing the stuff you're parsing is unlikely to use them because they make referring to the urls in javascript a pain.

answered Jan 3, 2013 at 2:30

dspeyer

3,0561 gold badge20 silver badges27 bronze badges

Comments

PleaseStand · Accepted Answer · 2013-01-03 02:49:15Z

0

For your specific case, this should work, provided that none of the characters in the URL are escaped.

preg_match("/src='([^']*)'/", $html, $matches);
$url = $matches[1];

See the preg_match() manual page. You should probably add error handling, ensuring that the function returns 1 (that the regex matched) and possibly performing some additional checks as well (such as ensuring that the URL begins with http:// and contains .mp4?).

(As with all Web scraping techniques, the owner or maintainer of the site you are scraping may make a future change that breaks your script, and you should be prepared for that.)

answered Jan 3, 2013 at 2:49

PleaseStand

32.2k7 gold badges72 silver badges96 bronze badges

Comments

Anubis · Accepted Answer · 2013-01-03 11:35:06Z

0

The following captures any url in your html

$matches=array();
if (preg_match_all('/src=["\'](?P<urls>https?:\/\/[^"\']+)["\']/', $html, $matches)){
    print_r($matches['urls']);
}

if you want to do the same in javascript you could use this:

var matches;
if (matches=html.match(/src=["'](https?:\/\/[^"']+)["']/g)){
//gives you all matches, but they are still including the src=" and " parts, so you would
//have to run every match again against the regex without the g modifier
}

answered Jan 3, 2013 at 11:35

Anubis

4814 silver badges4 bronze badges

Collectives™ on Stack Overflow

Parsing link from javascript function

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related