1

I have the following code(gdaten[n][2] gives an URL, n is the index):

    try:
        p=urlparse(gdaten[n][2])
        while p.scheme == "javascript" or p.scheme == "mailto":
            p=urlparse(gdaten[n][2])
            print(p," was skipped (", gdaten[n][2],")")
            n += 1
        print ("check:", gdaten[n][2])
        f = urllib.request.urlopen(gdaten[n][2])
        htmlcode = str(f.read())
        parser = MyHTMLParser(strict=False)
        parser.feed(htmlcode)

    except urllib.error.URLError:
        #do some stuff
    except IndexError:
        #do some stuff
    except ValueError:
        #do some stuff

Now I have the following error:

urllib.error.URLError: <urlopen error unknown url type: javascript>

in line 8. How is that possible? I thought with the while-loop I skip all those links with the scheme javascript? Why does the except not work? Where's my fault? MyHTMLParserappends the links found on the website to gdaten like that [[stuff,stuff, link][stuff,stuff, link]

1
  • That is not the real indentation of your code right? Also, can you show us a little more of your code? Commented Oct 24, 2013 at 18:33

1 Answer 1

3

This is an off by one error.

In other words, n and p are out of sync.

To fix this, add one to n before setting p.

Why wasn't this working?

Assuming n is set to zero at the start (could start at 42, it doesn't matter), let's say gdaten is laid out like so:

gdaten[0][2] = "javascript://blah.js"
gdaten[1][2] = "http://hello.com"
gdaten[2][2] = "javascript://moo.js"

Upon checking the first while condition, p.scheme is 'javascript' so we enter the loop. p gets set to urlparse("javascript://blah.js") again and n is increased to 1. Since we're checking urlparse("javascript://blah.js") again, we continue into the loop again.

p now gets set to urlparse("http://hello.com") and n gets set to 2.

Since urlparse("http://hello.com") passes the check, the while loop ends.

Meanwhile, since n is two, the url that gets opened is gdaten[2][2] which is "javascript://moo.js"

Code fix

try:
    p=urlparse(gdaten[n][2])
    while p.scheme == "javascript" and p.scheme == "mailto" and not p.scheme:
        print(p," was skipped (", gdaten[n][2],")")

        # Skipping to the next value
        n += 1
        p=urlparse(gdaten[n][2])

    print ("check:", gdaten[n][2])
    f = urllib.request.urlopen(gdaten[n][2])
    htmlcode = str(f.read())

...
Sign up to request clarification or add additional context in comments.

1 Comment

Tried exactly that 10 secs before you answered

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.