0

I'm using the following Regex (which I found online) to obtain the urls within a HTML page;

        Regex regex = new Regex(@"url\((?<char>['""])?(?<url>.*?)\k<char>?\)");

Works fine for the HTML below;

<div style="background:url(images/logo.png) no-repeat;">UK</div>

However returns more than I need when the HTML page contained the following Javascript, returning 'destpage'

function buildurl(destpage) 

I tried the following regex to include a colon, but it appears to be invalid

:url\((?<char>['""])?(?<:url>.*?)\k<char>?\)

Any help would be much appreciated.

3
  • 3
    stackoverflow.com/a/1732454/1043380 Stop using Regex for parsing html. Use a more appropriate tool. Commented Aug 28, 2013 at 15:00
  • 1
    Try using a \b (word boundary) instead of a colon. Commented Aug 28, 2013 at 15:00
  • @Jerry Adding \b around url seemed to do the trick. Cheers Commented Aug 28, 2013 at 15:17

2 Answers 2

3

To get all the URLs, use the HtmlAgilityPack instead of a Regex. From their example page

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{

}

You can expand on that to obtain your style urls by, for example, using //@style to get the style nodes and iterating through those to extract the url value.

Sign up to request clarification or add additional context in comments.

Comments

0

Only add the colon to the front:

:url\((?<char>['""])?(?<url>.*?)\k<char>?\)

The second "url" is the name of that group.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.