Remove the Query String from a Url in HTML with a Regular Expression

Question

Given a html document, what is the most correct and concise regular expression pattern to remove the query strings from each url in the document?

bobince · Accepted Answer · 2008-11-07 10:57:01Z

5

You can't usefully parse HTML with a regexp. If you know the format of the page in advance — eg.

links are always in the form < a href="url with no unnecessary character escapes">, or
all links are absolute, and no other non-link strings beginning with http: exist

then you can just about get away with it, but for general [X]HTML a regexp parser is unsuitable.

Depending on what language you're using, you'd need to find either an HTML parser library (eg. Python's BeautifulSoup), or an HTML tidier combined with a standard XML parser, then scan the document for < a> elements (and maybe others, eg. < img> if you're interested in those?), then split the attribute value on ‘?’.

answered Nov 7, 2008 at 10:57

bobince

538k111 gold badges675 silver badges846 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

EoghanM Over a year ago

Thank you bobince, I was actually using BeautifulSoup but was looking for a quick and dirty way rather than iterating through all the links.

Andrew Bullock · Accepted Answer · 2008-11-07 11:02:29Z

2

Re: Bobince's comment, the HTMLAgilityPack is a good html parser for .NET, its more forgiving with dealing with incorrect markup than other parsers.

Using this will let you find all the A tags, then you can get the HREF and simply remove anything after and including a '?'

answered Nov 7, 2008 at 11:02

Andrew Bullock

37.6k38 gold badges166 silver badges241 bronze badges

Comments

nickf · Accepted Answer · 2008-11-07 11:07:59Z

0

Find this:

/href="([^\?"]*?)\?[^\"]*"/

Replace with:

href="\1"

you may have to watch out that it doesn't strip <link> tags.

answered Nov 7, 2008 at 11:07

nickf

548k199 gold badges660 silver badges727 bronze badges

1 Comment

Greg Over a year ago

There's quite a few cases that won't match: href = "foo?bar", href = foo?bar (not valid but still could appear) href='foo?bar'

Collectives™ on Stack Overflow

Remove the Query String from a Url in HTML with a Regular Expression

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related