1

I have a one-column CSV file. Depending on how many failure codes my machine has, this column will have a different number of codes (up to 10 sub-columns - see example below).I want to manipulate this CSV such that the output is a clean list of unique failure codes that have occurred.

Sample CSV file (sample.csv):

ActiveFaults

00:1523 00:1345 00:1343 90:1344

00:1523 00:1345 00:1343 90:1344

00:1523 00:1345 00:1343 90:1344

00:1523 00:1345 00:1343 90:1344

00:1523 00:1345 00:1343 90:1344 90:5900 90:8988

00:1523 00:1345 00:1343 90:1344 90:5900 90:8988

BA:8797 BA: 1330

Ideal output would be a CSV file of the following form:

IdealOutput.csv

UniqueActiveFaults

00:1523

00:1345

00:1343

90:1344

90:5900

90:8988

BA:8797

BA:1330

Any ideas how this can be done? I have tried several ways (using -Sort, -Group, etc...but none has worked as desired) Thank you.

3 Answers 3

3

Stop thinking about the file as CSV.

Just read into a single string, split it by whitespace and pipe it to Sort-Object -Unique:

$Values = (Get-Content .\sample.csv -Raw) -split '\s+' | Where-Object {$_ -like '*:*'}
"UniqueActiveFaults" |Out-File .\IdealOutput.csv
$Values | Sort-Object -Unique | Out-File .\IdealOutput.csv

the -split operator takes a regular expression as its right-hand operand, in this case \s+. \s is a shorthand for the "whitespace" character class, and + means "match 1 or more of the preceding characters"

If the file is huge, you can split processing into chunks with the ReadCount parameter in the first statement:

Get-Content .\sample.csv -ReadCount 100 |ForEach-Object {$_ -split '\s+'}

If : is present elsewhere in the document and the desired values are always of the form

[2 character prefix]:[numerical]

you could narrow it by changing the Where-Object filter to:

{$_ -match '.{2}:\d+'}
Sign up to request clarification or add additional context in comments.

7 Comments

I think the file has a header row ("ActiveFaults"). Also the output file has a header row. In that case I think it would make sense to use import-csv and export-csv insteaed of get-content and out-file.
@dan-gph Yes, but if the file is huge, you would incur a massive overhead from creating objects with a single property UniqueActiveFaults just to be able to write it back to disk using Export-Csv. If the file had multiple columns it might make sense, but in this case I don't think the tradeoff is worth it
Massive overhead? That sounds like a premature optimization. For all we know the files are only ever 10 lines long. As it stands, your code doesn't meet the requirements because it doesn't deal with headers. By the way, I don't think the -ReadCount will help you. The Sort-Object will have to load the whole file into memory anyway.
@dan-gph Judging from the sample, I don't need to handle the header, my code still solves OP's problem. Regarding Raw vs. ReadCount: If the input file is > 1GB ASCII, it would be too big to fit into a single string, and thus ReadCount would in fact help. Sort-Object doesn't need to load the whole file, but the entire set of individual strings - there's a big difference
I can see a header line in both the input and output sample files. With the ReadCount, I mean it's not going to make any difference in terms of memory usage if you use -ReadCount 1 (the default) or -ReadCount 100 on the Get-Content. I guess it might speed it up a tiny bit. That sounds like another premature optimization.
|
2

Since Matthias didn't like my suggestion, I'll show what I meant here:

Import-Csv .\Sample.csv | 
    % { $_.ActiveFaults -split '\s+' } | 
    Sort-Object -Unique | 
    Select-Object @{name='UniqueActiveFaults'; expr={ $_ } } | 
    Export-Csv IdealOutput.csv -NoTypeInformation

The output looks like this:

"UniqueActiveFaults"
"00:1343"
"00:1345"
"00:1523"
"90:1344"
"90:5900"
"90:8988"
"BA:1330"
"BA:8797"

If the input were really huge and the above code couldn't deal with it efficiently, I'd try piping the values into a .NET HashSet in place of the Sort-Object.

2 Comments

Doh! Had not thought about using calculated expressions with Select-Object. In that light, your suggestion does indeed make a lot of sense, I most certainly like it ;-)
Thanks @Mathias, credit to you for the basic idea. I hope my comments weren't too annoying :)
0
@ECHO Off
SETLOCAL
:: remove variables starting $
FOR  /F "delims==" %%a In ('set $ 2^>Nul') DO SET "%%a="
(
 ECHO(UniqueAciveFaults
 FOR /f "delims=" %%a IN (q29884835.txt) DO FOR %%b IN (%%a) DO SET "$%%b=y"
 FOR /f "delims=$=" %%a IN ('set $^|find ":"') DO ECHO(%%a

)>u:\newfile.csv

GOTO :EOF

I used a file named q29884835.txt containing your data for my testing.

Produces u:\newfile.csv

Well - it's obviously not powershell, but it works.

The first for clears out any environment variables starting $. There normally are none, so it's probably not required.

The second for line reads the file and then for each element sets a variavle $elementcontents to y (the fact that it's set to something is important, the something is not)

The third for line selects that part of the set $ variables that contain : and echoes them.

1 Comment

That is impressive. I take my hat off to you. But frankly that code looks pretty horrific. Why not learn PowerShell? ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.