2

I am new to programming and Powershell, I've put together the following script; it parses through all the emails in a specified folder and extract the URLs from them. The script uses a regex pattern to identify the URLs and then extracts them to a text file. The extracted text is then run through another command where I am trying to remove the http:// or https:// portion (I need help with figuring this out), these are placed into another text file, from which I go through again to remove duplicates.

The main issue I am having is that the regex doesnt appear to extract the urls correctly. What I am getting is something like an example I have created below:

URL is http://www.dropbox.com/3jksffpwe/asdj.exe But I end up getting

dropbox.com/3jksffpwe/asdj.exe
dropbox.com 
drop  
dropbox

The script is

#Adjust paths to location of saved Emails
$in_files = ‘C:\temp\*.eml, *.msg’  
$out_file = ‘C:\temp\Output.txt’  
$Working_file = ‘C:\temp\working.txt'  
$Parsed_file = ‘C:\temp\cleaned.txt'  

# Removes the old output file from earlier runs.
if (Test-Path $Parsed_file) {
  Remove-Item $Parsed_file
}

# regex to parse thru each email and extract the URLs to a text file
$regex = ‘([a-zA-Z]{3,})://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)*?’  
select-string -Path $in_files -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $out_file

#Parses thru the output of urls to strip out the http or https portion  
Get-Content $Out_file | ForEach-Object {$_.SubString(7)} | Out-File    $Working_file


#Parses thru again to remove exact duplicates  
 $set = @{}  
 Get-Content $Working_file | %{  
   if (!$set.Contains($_)) {  
       $set.Add($_, $null)  
        $_  
    }  
} | Set-Content $Parsed_file  


#Removes the files no longer required  
Del $out_file, $Working_file  

#Confirms if the email messages should be removed  
$Response = Read-Host "Do you want to remove the old messages? (Y|N)"  

If ($Response -eq "Y") {del *.eml, *msg}  

#Opens the output file in notepad  
Notepad $Parsed_file  

Exit   

Thanks for any help

2
  • What output do you expect? Commented Feb 1, 2015 at 4:10
  • You have multiple match groups (..) in your regex. It is simply returning all the matches. As requested. The current answer seems to address that at least on PowerShell 3.0 Commented Feb 1, 2015 at 14:26

2 Answers 2

4

Try this RegEx:

(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)

But remember that powershell -match is only capturing the first match. To capture all matches you could do something like this:

$txt="https://test.com, http://tes2.net, http:/test.com, http://test3.ro, text, http//:wrong.value";$hash=@{};$txt|select-string -AllMatches '(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)'|%{$hash."Valid URLs"=$_.Matches.value};$hash

Best of luck! Enjoy!

Sign up to request clarification or add additional context in comments.

Comments

3

RegExp for checking for URL can be like:

(?i)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Check for more info here.

1 Comment

thank you. I was trying to extract the urls in the following format: Http://dropbox.com/ffsdfsdfdsf/sdfdfdsf.asp

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.