0

Given a html document, what is the most correct and concise regular expression pattern to remove the query strings from each url in the document?

3 Answers 3

5

You can't usefully parse HTML with a regexp. If you know the format of the page in advance — eg.

  • links are always in the form < a href="url with no unnecessary character escapes">, or
  • all links are absolute, and no other non-link strings beginning with http: exist

then you can just about get away with it, but for general [X]HTML a regexp parser is unsuitable.

Depending on what language you're using, you'd need to find either an HTML parser library (eg. Python's BeautifulSoup), or an HTML tidier combined with a standard XML parser, then scan the document for < a> elements (and maybe others, eg. < img> if you're interested in those?), then split the attribute value on ‘?’.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you bobince, I was actually using BeautifulSoup but was looking for a quick and dirty way rather than iterating through all the links.
2

Re: Bobince's comment, the HTMLAgilityPack is a good html parser for .NET, its more forgiving with dealing with incorrect markup than other parsers.

Using this will let you find all the A tags, then you can get the HREF and simply remove anything after and including a '?'

Comments

0

Find this:

/href="([^\?"]*?)\?[^\"]*"/

Replace with:

href="\1"

you may have to watch out that it doesn't strip <link> tags.

1 Comment

There's quite a few cases that won't match: href = "foo?bar", href = foo?bar (not valid but still could appear) href='foo?bar'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.