Python urlparse: small issue

Question

I'm making an app that parses html and gets images from it. Parsing is easy using Beautiful Soup and downloading of the html and the images works too with urllib2.

I do have a problem with urlparse to make absolute paths out of relative ones. The problem is best explained with an example:

>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'

As you can see, urlparse doesn't take away the ../ away. This gives a problem when I try to download the image:

HTTPError: HTTP Error 400: Bad Request

Is there a way to fix this problem in urllib?

A relative href="../test.png" works but not href="example.com/../test.png" ? — Paulo Scardine
– Paulo Scardine, Commented Nov 6, 2010 at 17:46

rtpg · Accepted Answer · 2010-11-06 17:30:10Z

3

".." would bring you up one directory ("." is current directory), so combining that with a domain name url doesn't make much sense. Maybe what you need is:

>>> urlparse.urljoin("http://www.example.com","./test.png")
'http://www.example.com/test.png'

answered Nov 6, 2010 at 17:30

rtpg

2,4491 gold badge18 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mew Over a year ago

While this is a solution, this won't work in my case: my application has to be able to retrieve images from any websites.I can't just replace "../" by "./" because this would break for other sites where it is actually supposed to go look at the parent directory.

rtpg Over a year ago

urlparse.urljoin("example.com/dir/","../test.png") works for me ( I get 'example.com/test.png'). I guess it's just that ".." doesn't mean anything in the context you have (what is one directory up the base one). At least I don't think it does.

vhallac · Accepted Answer · 2010-11-06 17:55:00Z

2

I think the best you can do is to pre-parse the original URL, and check the path component. A simple test is

if len(urlparse.urlparse(baseurl).path) > 1:

Then you can combine it with the indexing suggested by demas. For example:

start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])

This way, you will not attempt to go to the parent of the root URL.

edited Nov 6, 2010 at 17:55

answered Nov 6, 2010 at 17:48

vhallac

14.2k3 gold badges28 silver badges36 bronze badges

1 Comment

Mew Over a year ago

Thanks, I'll go this route and implement something like that.

jfs · Accepted Answer · 2010-11-07 20:06:51Z

1

If you'd like that /../test would mean the same as /test like paths in a file system then you could use normpath():

>>> url = urlparse.urljoin("http://example.com/", "../test")
>>> p = urlparse.urlparse(url)
>>> path = posixpath.normpath(p.path)
>>> urlparse.urlunparse((p.scheme, p.netloc, path, p.params, p.query,p.fragment))
'http://example.com/test'

edited Nov 7, 2010 at 20:06

answered Nov 7, 2010 at 19:50

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Comments

ceth · Accepted Answer · 2010-11-06 17:31:30Z

0

urlparse.urljoin("http://www.example.com/", "../test.png"[2:])

It is what you need?

answered Nov 6, 2010 at 17:31

ceth

45.5k63 gold badges191 silver badges300 bronze badges

1 Comment

Mew Over a year ago

This has the same problem as Dasuraga's solution: it would only work for that certain website, while breaking others.

Collectives™ on Stack Overflow

Python urlparse: small issue

4 Answers 4

2 Comments

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related