0

I'm writing a simple python script so I can test my websites from a different ip address.

The url of a page is given in the querystring, the script fetches the page and displays it to the user. The code below is used to rewrite the tags that contain urls but I don't think it's complete/totally correct.

def rel2abs(rel_url, base=loc):
    return urlparse.urljoin(base, rel_url)

def is_proxy_else_abs(tag, attr):
    if tag in ('a',):
        return True
    if tag in ('form', 'img', 'link') and attr in ('href', 'src', 'action', 'background'):
        return False

def repl(matchobj):
    if is_proxy_else_abs(matchobj.group(1).lower(), matchobj.group(3).lower()):
        return r'<%s %s %s="http://%s?%s" ' %(proxy_script_url, matchobj.group(1), matchobj.group(2), matchobj.group(3), urllib.urlencode({'loc':rel2abs(matchobj.group(5))}))
    else:
        return r'<%s %s %s="%s" ' %(matchobj.group(1), matchobj.group(2), matchobj.group(3), rel2abs(matchobj.group(5)))

def fix_urls(page):
    get_link_re = re.compile(r"""<(a|form|img|link) ([^>]*?)(href|src|action|background)\s*=\s*("|'?)([^>]*?)\4""", re.I|re.DOTALL)
    page = get_link_re.sub(repl, page)
    return page

The idea is that 'a' tag's href attributes should be routed through the proxy script, but css, javascript, images, forms etc should not be, so these have to be made absolute if they are relative in the original page.

The problem is the code doesn't always work, css can be written in a number of ways etc. Is there a more comprehensive regex I can use?

1
  • Probably a silly question, but have you consider simply writing a REAL http proxy? With a real proxy you shouldn't have to rewrite anything since your browser will be explicitly configured to use it. It will generally work a lot better, and be a lot easier to write. Commented Dec 29, 2008 at 20:07

1 Answer 1

3

Please read other postings here about parsing HTML. For example Python regular expression for HTML parsing (BeautifulSoup) and HTML parser in Python.

Use Beautiful Soup, not regular expressions.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.