2

I'm writing a PowerShell Script which extracts URL's from ASPX files and test if their HTTP Statuscode is equal to 200.

I found the following Regex to get the URL:

$regex = "(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)"
select-string -Path $path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value }

But the return looks like this:

https://code.jquery.com/ui/1.9.0/themes/base/jquery-ui.css"/>
https://code.jquery.com/ui/1.11.4/jquery-ui.min.js"></script>

as you can see, it doesn't really trim the end of the HTML Tags.

How can I edit my regex to get the URL without the HTML Tags in the end?

2
  • Replace [^\s,] with [^\s,<>"] Commented Aug 18, 2017 at 7:15
  • @WiktorStribiżew Perfect, thanks! Commented Aug 18, 2017 at 7:17

1 Answer 1

2

If you have a look at the [^\s,] negated character class, you will see it matches any char but whitespace and ,. If you look at the input you have, you will notice that " and < and > can all be matched with [^\s,].

A fix for the current situation is to add <>" chars into the negated character class to make the regex engine "stop" when it comes across the >, < and " chars.

Note that since you extract whole matches, you may refactor the pattern a bit and remove unnecessary groupings and turn the first one into a non-capturing group:

$regex = '(?:http|s?ftp)s?://[^\s,<>"]+'

Mind that in .NET patterns, / does not need to be escaped (it is not a special regex metacharacter/operator).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.