0

I am able to scrape a page for URLs, but I want to know what is the easiest way to convert the various formats that these links can be in, into a fully fledged url. For example:

If I scrape: www.mysite.com/some/place/in/space.html

And I get the following urls:

../img.jpg
img.jpg
../../bla.jpg
inc/bla.jpg
/
./

They should resolve to

www.mysite.com/some/place/img.jpg
www.mysite.com/some/place/in/img.jpg
www.mysite.com/some/bla.jpg
www.mysite.com/some/place/in/inc/bla.jpg
www.mysite.com/some/place/in/
www.mysite.com/some/place/in/

Is there a function that does this for all cases or is it something I would have to code?

3 Answers 3

1

I use this function for a crawler i wrote long time ago: http://codepad.org/1VxMECNj

call the function with host prepended:

relativeUrl('http://host/dir/dir2/../../file.html');
//> returns http://host/file.html
Sign up to request clarification or add additional context in comments.

Comments

0

You can just add www.mysite.com/some/place/in/ in front of the urls.. www.mysite.com/some/place/in/../img.jpg should resolve I think.

Comments

0

You could do a REGEX to replace the relative links with the absolute URLs:

$data = preg_replace('#(href|src)="([^:"]*)("|(?:(?:%20|\s|\+)[^"]*"))#', '$1="' . $site_url . '$2$3', $data);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.