0

I'm trying to parse a direct link out of a javascript function within a page. I'm able to parse the html info I need, but am stumped on the javascript part. Is this something that is achievable with php and possibly regex?

function videoPoster() {
    document.getElementById("html5_vid").innerHTML = 
        "<video x-webkit-airplay='allow' id='html5_video' style='margin-top:" 
        + style_padding 
        + "px;' width='400' preload='auto' height='325' controls onerror='cantPlayVideo()' " 
        + "<source src='http://video-website.com/videos/videoname.mp4' type='video/mp4'>";
}

What I need to pull out is the link "http://video-website.com/videos/videoname.mp4". Any help or pointers would be greatly appreciated!

4
  • 2
    does it always start with http:// and end with .mp4? Commented Jan 3, 2013 at 1:57
  • There is actually a session ID that trails after the .mp4, but of course it will change every time the page is reloaded. Commented Jan 3, 2013 at 2:08
  • @user1941752 If you can identify the URL by the first occurences of http:// and .mp4, that's what a regex excels at. Commented Jan 3, 2013 at 2:19
  • @user1941752 ...any of the answers was helpful? Commented Nov 12, 2013 at 4:54

4 Answers 4

2

/http://.*\.mp4/ will give you all characters between http:// and .mp4, inclusive.

See it in action.

If you need the session id, use something like /http://.*\.mp4?sessionid=\d+/

Sign up to request clarification or add additional context in comments.

Comments

2

In general, no. Nothing short of a full javascript parser will always extract urls, and even then you'll have trouble with urls that are computed nontrivially.

In practice, it is often best to use the simplest capturing regexp that works for the code you actually need to parse. In this case:

['"](http://[^'"]*)['"]

If you have to enter that regexp as a string, beware of escaping.

If you ever have unescaped quotation marks in urls, this will fail. That's valid but rare. Whoever is writing the stuff you're parsing is unlikely to use them because they make referring to the urls in javascript a pain.

Comments

0

For your specific case, this should work, provided that none of the characters in the URL are escaped.

preg_match("/src='([^']*)'/", $html, $matches);
$url = $matches[1];

See the preg_match() manual page. You should probably add error handling, ensuring that the function returns 1 (that the regex matched) and possibly performing some additional checks as well (such as ensuring that the URL begins with http:// and contains .mp4?).

(As with all Web scraping techniques, the owner or maintainer of the site you are scraping may make a future change that breaks your script, and you should be prepared for that.)

Comments

0

The following captures any url in your html

$matches=array();
if (preg_match_all('/src=["\'](?P<urls>https?:\/\/[^"\']+)["\']/', $html, $matches)){
    print_r($matches['urls']);
}

if you want to do the same in javascript you could use this:

var matches;
if (matches=html.match(/src=["'](https?:\/\/[^"']+)["']/g)){
//gives you all matches, but they are still including the src=" and " parts, so you would
//have to run every match again against the regex without the g modifier
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.