Removing Duplicates from CSV File using PowerShell

Question

I have a one-column CSV file. Depending on how many failure codes my machine has, this column will have a different number of codes (up to 10 sub-columns - see example below).I want to manipulate this CSV such that the output is a clean list of unique failure codes that have occurred.

Sample CSV file (sample.csv):

ActiveFaults

00:1523 00:1345 00:1343 90:1344

00:1523 00:1345 00:1343 90:1344

00:1523 00:1345 00:1343 90:1344

00:1523 00:1345 00:1343 90:1344

00:1523 00:1345 00:1343 90:1344 90:5900 90:8988

00:1523 00:1345 00:1343 90:1344 90:5900 90:8988

BA:8797 BA: 1330

Ideal output would be a CSV file of the following form:

IdealOutput.csv

UniqueActiveFaults

00:1523

00:1345

00:1343

90:1344

90:5900

90:8988

BA:8797

BA:1330

Any ideas how this can be done? I have tried several ways (using -Sort, -Group, etc...but none has worked as desired) Thank you.

Mathias R. Jessen · Accepted Answer · 2015-04-27 18:33:35Z

3

Stop thinking about the file as CSV.

Just read into a single string, split it by whitespace and pipe it to Sort-Object -Unique:

$Values = (Get-Content .\sample.csv -Raw) -split '\s+' | Where-Object {$_ -like '*:*'}
"UniqueActiveFaults" |Out-File .\IdealOutput.csv
$Values | Sort-Object -Unique | Out-File .\IdealOutput.csv

the -split operator takes a regular expression as its right-hand operand, in this case \s+. \s is a shorthand for the "whitespace" character class, and + means "match 1 or more of the preceding characters"

If the file is huge, you can split processing into chunks with the ReadCount parameter in the first statement:

Get-Content .\sample.csv -ReadCount 100 |ForEach-Object {$_ -split '\s+'}

If : is present elsewhere in the document and the desired values are always of the form

[2 character prefix]:[numerical]

you could narrow it by changing the Where-Object filter to:

{$_ -match '.{2}:\d+'}

edited Apr 27, 2015 at 18:33

answered Apr 27, 2015 at 0:19

Mathias R. Jessen

178k13 gold badges175 silver badges234 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

dan-gph Over a year ago

I think the file has a header row ("ActiveFaults"). Also the output file has a header row. In that case I think it would make sense to use import-csv and export-csv insteaed of get-content and out-file.

Mathias R. Jessen Over a year ago

@dan-gph Yes, but if the file is huge, you would incur a massive overhead from creating objects with a single property UniqueActiveFaults just to be able to write it back to disk using Export-Csv. If the file had multiple columns it might make sense, but in this case I don't think the tradeoff is worth it

dan-gph Over a year ago

Massive overhead? That sounds like a premature optimization. For all we know the files are only ever 10 lines long. As it stands, your code doesn't meet the requirements because it doesn't deal with headers. By the way, I don't think the -ReadCount will help you. The Sort-Object will have to load the whole file into memory anyway.

Mathias R. Jessen Over a year ago

@dan-gph Judging from the sample, I don't need to handle the header, my code still solves OP's problem. Regarding Raw vs. ReadCount: If the input file is > 1GB ASCII, it would be too big to fit into a single string, and thus ReadCount would in fact help. Sort-Object doesn't need to load the whole file, but the entire set of individual strings - there's a big difference

dan-gph Over a year ago

I can see a header line in both the input and output sample files. With the ReadCount, I mean it's not going to make any difference in terms of memory usage if you use -ReadCount 1 (the default) or -ReadCount 100 on the Get-Content. I guess it might speed it up a tiny bit. That sounds like another premature optimization.

|

dan-gph · Accepted Answer · 2015-04-27 14:17:47Z

2

Since Matthias didn't like my suggestion, I'll show what I meant here:

Import-Csv .\Sample.csv | 
    % { $_.ActiveFaults -split '\s+' } | 
    Sort-Object -Unique | 
    Select-Object @{name='UniqueActiveFaults'; expr={ $_ } } | 
    Export-Csv IdealOutput.csv -NoTypeInformation

The output looks like this:

"UniqueActiveFaults"
"00:1343"
"00:1345"
"00:1523"
"90:1344"
"90:5900"
"90:8988"
"BA:1330"
"BA:8797"

If the input were really huge and the above code couldn't deal with it efficiently, I'd try piping the values into a .NET HashSet in place of the Sort-Object.

answered Apr 27, 2015 at 14:17

dan-gph

17.1k13 gold badges65 silver badges83 bronze badges

2 Comments

Mathias R. Jessen Over a year ago

Doh! Had not thought about using calculated expressions with Select-Object. In that light, your suggestion does indeed make a lot of sense, I most certainly like it ;-)

dan-gph Over a year ago

Thanks @Mathias, credit to you for the basic idea. I hope my comments weren't too annoying :)

Magoo · Accepted Answer · 2015-04-27 06:56:06Z

0

@ECHO Off
SETLOCAL
:: remove variables starting $
FOR  /F "delims==" %%a In ('set $ 2^>Nul') DO SET "%%a="
(
 ECHO(UniqueAciveFaults
 FOR /f "delims=" %%a IN (q29884835.txt) DO FOR %%b IN (%%a) DO SET "$%%b=y"
 FOR /f "delims=$=" %%a IN ('set $^|find ":"') DO ECHO(%%a

)>u:\newfile.csv

GOTO :EOF

I used a file named q29884835.txt containing your data for my testing.

Produces u:\newfile.csv

Well - it's obviously not powershell, but it works.

The first for clears out any environment variables starting $. There normally are none, so it's probably not required.

The second for line reads the file and then for each element sets a variavle $elementcontents to y (the fact that it's set to something is important, the something is not)

The third for line selects that part of the set $ variables that contain : and echoes them.

answered Apr 27, 2015 at 6:56

Magoo

80.8k8 gold badges68 silver badges92 bronze badges

1 Comment

dan-gph Over a year ago

That is impressive. I take my hat off to you. But frankly that code looks pretty horrific. Why not learn PowerShell? ;)

Collectives™ on Stack Overflow

Removing Duplicates from CSV File using PowerShell

3 Answers 3

7 Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related