Regex to match URL in Powershell

Question

I am new to programming and Powershell, I've put together the following script; it parses through all the emails in a specified folder and extract the URLs from them. The script uses a regex pattern to identify the URLs and then extracts them to a text file. The extracted text is then run through another command where I am trying to remove the http:// or https:// portion (I need help with figuring this out), these are placed into another text file, from which I go through again to remove duplicates.

The main issue I am having is that the regex doesnt appear to extract the urls correctly. What I am getting is something like an example I have created below:

URL is http://www.dropbox.com/3jksffpwe/asdj.exe But I end up getting

dropbox.com/3jksffpwe/asdj.exe
dropbox.com 
drop  
dropbox

The script is

#Adjust paths to location of saved Emails
$in_files = ‘C:\temp\*.eml, *.msg’  
$out_file = ‘C:\temp\Output.txt’  
$Working_file = ‘C:\temp\working.txt'  
$Parsed_file = ‘C:\temp\cleaned.txt'  

# Removes the old output file from earlier runs.
if (Test-Path $Parsed_file) {
  Remove-Item $Parsed_file
}

# regex to parse thru each email and extract the URLs to a text file
$regex = ‘([a-zA-Z]{3,})://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)*?’  
select-string -Path $in_files -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $out_file

#Parses thru the output of urls to strip out the http or https portion  
Get-Content $Out_file | ForEach-Object {$_.SubString(7)} | Out-File    $Working_file


#Parses thru again to remove exact duplicates  
 $set = @{}  
 Get-Content $Working_file | %{  
   if (!$set.Contains($_)) {  
       $set.Add($_, $null)  
        $_  
    }  
} | Set-Content $Parsed_file  


#Removes the files no longer required  
Del $out_file, $Working_file  

#Confirms if the email messages should be removed  
$Response = Read-Host "Do you want to remove the old messages? (Y|N)"  

If ($Response -eq "Y") {del *.eml, *msg}  

#Opens the output file in notepad  
Notepad $Parsed_file  

Exit

Thanks for any help

You have multiple match groups (..) in your regex. It is simply returning all the matches. As requested. The current answer seems to address that at least on PowerShell 3.0 — Matt
– Matt, Commented Feb 1, 2015 at 14:26

user5156318 · Accepted Answer · 2015-07-25 21:57:42Z

4

Try this RegEx:

(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)

But remember that powershell -match is only capturing the first match. To capture all matches you could do something like this:

$txt="https://test.com, http://tes2.net, http:/test.com, http://test3.ro, text, http//:wrong.value";$hash=@{};$txt|select-string -AllMatches '(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)'|%{$hash."Valid URLs"=$_.Matches.value};$hash

Best of luck! Enjoy!

answered Jul 25, 2015 at 21:57

user5156318

412 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

nitishagar · Accepted Answer · 2015-02-01 03:50:33Z

3

RegExp for checking for URL can be like:

(?i)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Check for more info here.

answered Feb 1, 2015 at 3:50

nitishagar

9,4313 gold badges32 silver badges41 bronze badges

1 Comment

samash Over a year ago

thank you. I was trying to extract the urls in the following format: Http://dropbox.com/ffsdfsdfdsf/sdfdfdsf.asp

Collectives™ on Stack Overflow

Regex to match URL in Powershell

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related