2

I'm making an app that parses html and gets images from it. Parsing is easy using Beautiful Soup and downloading of the html and the images works too with urllib2.

I do have a problem with urlparse to make absolute paths out of relative ones. The problem is best explained with an example:

>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'

As you can see, urlparse doesn't take away the ../ away. This gives a problem when I try to download the image:

HTTPError: HTTP Error 400: Bad Request

Is there a way to fix this problem in urllib?

1

4 Answers 4

3

".." would bring you up one directory ("." is current directory), so combining that with a domain name url doesn't make much sense. Maybe what you need is:

>>> urlparse.urljoin("http://www.example.com","./test.png")
'http://www.example.com/test.png'
Sign up to request clarification or add additional context in comments.

2 Comments

While this is a solution, this won't work in my case: my application has to be able to retrieve images from any websites.I can't just replace "../" by "./" because this would break for other sites where it is actually supposed to go look at the parent directory.
urlparse.urljoin("example.com/dir/","../test.png") works for me ( I get 'example.com/test.png'). I guess it's just that ".." doesn't mean anything in the context you have (what is one directory up the base one). At least I don't think it does.
2

I think the best you can do is to pre-parse the original URL, and check the path component. A simple test is

if len(urlparse.urlparse(baseurl).path) > 1:

Then you can combine it with the indexing suggested by demas. For example:

start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])

This way, you will not attempt to go to the parent of the root URL.

1 Comment

Thanks, I'll go this route and implement something like that.
1

If you'd like that /../test would mean the same as /test like paths in a file system then you could use normpath():

>>> url = urlparse.urljoin("http://example.com/", "../test")
>>> p = urlparse.urlparse(url)
>>> path = posixpath.normpath(p.path)
>>> urlparse.urlunparse((p.scheme, p.netloc, path, p.params, p.query,p.fragment))
'http://example.com/test'

Comments

0
urlparse.urljoin("http://www.example.com/", "../test.png"[2:])

It is what you need?

1 Comment

This has the same problem as Dasuraga's solution: it would only work for that certain website, while breaking others.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.