I am new to programming and Powershell, I've put together the following script; it parses through all the emails in a specified folder and extract the URLs from them. The script uses a regex pattern to identify the URLs and then extracts them to a text file. The extracted text is then run through another command where I am trying to remove the http:// or https:// portion (I need help with figuring this out), these are placed into another text file, from which I go through again to remove duplicates.
The main issue I am having is that the regex doesnt appear to extract the urls correctly. What I am getting is something like an example I have created below:
URL is http://www.dropbox.com/3jksffpwe/asdj.exe
But I end up getting
dropbox.com/3jksffpwe/asdj.exe
dropbox.com
drop
dropbox
The script is
#Adjust paths to location of saved Emails
$in_files = ‘C:\temp\*.eml, *.msg’
$out_file = ‘C:\temp\Output.txt’
$Working_file = ‘C:\temp\working.txt'
$Parsed_file = ‘C:\temp\cleaned.txt'
# Removes the old output file from earlier runs.
if (Test-Path $Parsed_file) {
Remove-Item $Parsed_file
}
# regex to parse thru each email and extract the URLs to a text file
$regex = ‘([a-zA-Z]{3,})://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)*?’
select-string -Path $in_files -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $out_file
#Parses thru the output of urls to strip out the http or https portion
Get-Content $Out_file | ForEach-Object {$_.SubString(7)} | Out-File $Working_file
#Parses thru again to remove exact duplicates
$set = @{}
Get-Content $Working_file | %{
if (!$set.Contains($_)) {
$set.Add($_, $null)
$_
}
} | Set-Content $Parsed_file
#Removes the files no longer required
Del $out_file, $Working_file
#Confirms if the email messages should be removed
$Response = Read-Host "Do you want to remove the old messages? (Y|N)"
If ($Response -eq "Y") {del *.eml, *msg}
#Opens the output file in notepad
Notepad $Parsed_file
Exit
Thanks for any help
(..)in your regex. It is simply returning all the matches. As requested. The current answer seems to address that at least on PowerShell 3.0