5

I guess the question is in the title.

I have a CSV that looks something like

user,path,original_path

I'm trying to find duplicates on the original path, then output both the user and original_path line.

This is what I have so far.

$2 = Import-Csv 'Total 20_01_16.csv' | Group-Object -Property Original_path | 
Where-Object { $_.count -ge 2 } | fl Group | out-string -width 500

This gives me the duplicates in Original_Path. I can see all the required information but I'll be danged if I know how to get to it or format it into something useful.

I did a bit of Googleing and found this script:

$ROWS = Import-CSV -Path 'Total 20_01_16.csv'
$NAMES = @{}
$OUTPUT = foreach ( $ROW in $ROWS ) { 
IF ( $NAMES.ContainsKey( $ROW.Original_path ) -and $NAMES[$ROW.original_path] -lt 2 ) 
{ $ROW }
$NAMES[$ROW.original_path] += 1 }

Write-Output $OUTPUT

I'm reluctant to use this because, well first I have no idea what it's doing. So little of the makes any sense to me, I don't like using scripts I can't get my head around. Also, and this is the more important part, it's only giving me a single duplicate, it's not giving me both sets. I'm after both offending lines, so I can find both users with the same file.

If anyone could be so kind as to lend a hand I'd appreciate it. Thanks

2
  • What output do you need? The original csv-rows for duplicates? Commented Jan 21, 2016 at 11:42
  • Pretty much the whole thing. So if a duplicate is found in Original_Path, I want User,Path,Original_Path But I need the output for both discoveries. So if my csv looks like this: user,path,original_path user1,\\compa\c$\program files\test.doc,\\server1\files\test1.doc user2,\\compb\c$\program files\test.doc,\\server1\files\test1.doc I'll need to know about both user1 and user2, not just user2 which is all I'm getting at the moment. Thanks Commented Jan 21, 2016 at 11:51

2 Answers 2

11

It depends on the output format you need, but to build on what you already have we can use this to show the records in the console:

Import-Csv 'Total 20_01_16.csv' |
Group-Object -Property Original_path |
Where-Object { $_.count -ge 2 } |
Foreach-Object { $_.Group } |
Format-Table User, Path, Original_path -AutoSize

Alternatively, use this to save them in a new csv-file:

Import-Csv 'Total 20_01_16.csv' |
Group-Object -Property Original_path |
Where-Object { $_.count -ge 2 } |
Foreach-Object { $_.Group } |
Select User, Path, Original_path |
Export-csv -Path output.csv -NoTypeInformation
Sign up to request clarification or add additional context in comments.

3 Comments

I'm surprised there's not an option like unix's "uniq -d" that prints out the duplicates. I also tried "sort -u propertyname" and then doing a diff with the original array, but it didn't work well.
Jeeze this almost killed my pc for a 430.000.000 line file (not exaggerating, and running 64GB ram). Isn't there something cheaper? The file is already sorted.
Thousands vs millions lines require different approaches for text reading in general. Look into System.IO.StreamReader + a dictionary/hashset/whatever to quickly lookup duplicates. Ex. of using StreamReader: stackoverflow.com/questions/35119112/…
0

The logic above is skipping a defined field that is blank for all values on the file doesn't output the duplicate records / fields.

1 Comment

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.