Merging CSV rows using Python2 and keeping data from a single arbitrary column

Question

I know there are a lot of questions on this topic but the answers aren't particularly well explained so it's difficult to adapt to my use case. The one here seems very promising but the syntax is rather complex and I'm having difficulty understanding and adapting it.

I have a need to convert the raw CSV output from Nessus to a standard format which essentially dumps a lot of the columns keeping only the Severity, IP addresses, and output of each finding. I've put a script together which does just that but if a finding is on multiple hosts/ports there is a different row for each host/port.

What I need is to merge rows based on the vulnerability name but only keeping the IP address data.

Example input - shortened for ease

High,10.10.10.10,MS12-345(this is the name),Hackers can do bad things
High,10.10.10.11,MS12-345(this is the name),Hackers can do bad things

Example output

High,10.10.10.10 10.10.10.11,MS12-345(this is the name),Hackers can do bad things

Below is my script so far. I would appreciate if you make your answer easily adaptable (read: idiot-proof) for future readers and I'm sure they would too.

Bonus:

Sometimes the output field is different for findings with the same name, sometimes it is the same. If you've got some time on your hands, why not help a man check for this and append in the same way as the IP addresses if there's a difference in the output?

import sys
import csv

def manipulate(inFile):

    with open(inFile, 'rb') as csvFile:
        fileReader = csv.reader(csvFile, dialect='excel')

        # Check for multiple instances of findings and merge the rows
        # This happens when the finding is on multiple hosts/ports

        //YOUR CODE WILL GO HERE (Probably...)

        # Place findings into lists: crits, highs, meds, lows for sorting later
        crits = []
        highs = []
        meds = []
        lows = []

        for row in fileReader:

            if row[3] == "Critical":    
                crits.append(row)
            elif row[3] == "High":
                highs.append(row)
            elif row[3] == "Medium":
                meds.append(row)
            elif row[3] == "Low":
                lows.append(row)

        # Open an output file for writing
        with open('output.csv', 'wb') as outFile: 
            fileWriter = csv.writer(outFile)

            # Add in findings from lists in order of severity. Only relevant columns included
            for c in crits:
                fileWriter.writerow( (c[3], c[4], c[7], c[12]) )

            for h in highs:
                fileWriter.writerow( (h[3], h[4], h[7], h[12]) )

            for m in meds:
                fileWriter.writerow( (m[3], m[4], m[7], m[12]) )

            for l in lows:
                fileWriter.writerow( (l[3], l[4], l[7], l[12]) )


# Input validation
if len(sys.argv) != 2:
    print 'You must provide a csv file to process'
    raw_input('Example: python nesscsv.py foo.csv')
else:
    print "Working..."
    # Store filename for use in manipulate function
    inFile = str(sys.argv[1])
    # Call manipulate function passing csv
    manipulate(inFile)

print "Done!"   
raw_input("Output in output.csv. Hit return to finish.")

I'm slightly confused on what your expected output should be. Do you just want to remove duplicate values? — MattR
– MattR, Commented Mar 15, 2017 at 14:05
Yes and no. For each vulnerability I only want a single row. However, there is a separate row for every IP address that has that particular vulnerability so if I just delete every row with the same vulnerability I would end up with only a single IP address when it could affect any number. — I_GNU_it_all_along
– I_GNU_it_all_along, Commented Mar 15, 2017 at 14:18
So what I need to do is, for every duplicate, append the new IP address to the same entry as the one for the old IP address before dropping the row. So I end up with just one row per vuln but listing every IP address affected by that vuln. — I_GNU_it_all_along
– I_GNU_it_all_along, Commented Mar 15, 2017 at 14:19

Norman · Accepted Answer · 2017-03-18 03:03:41Z

Here's a solution that uses an OrderedDict to gather rows in a way that preserves their order while also allowing to look up any row by its vulnerability name.

import sys
import csv
from collections import OrderedDict

def manipulate(inFile):

    with open(inFile, 'rb') as csvFile:
        fileReader = csv.reader(csvFile, dialect='excel')

        # Check for multiple instances of findings and merge the rows
        # This happens when the finding is on multiple hosts/ports

        # Dictionary mapping vulns to merged rows.
        # It's ordered to preserve the order of rows.
        mergedRows = OrderedDict()

        for newRow in fileReader:
            vuln = newRow[7]
            if vuln not in mergedRows:
                # Convert the host and output fields into lists so we can easily
                # append values from rows that get merged with this one.
                newRow[4] = [newRow[4], ]
                newRow[12] = [newRow[12], ]
                # Add row for new vuln to dict.
                mergedRows[vuln] = newRow
            else:
                # Look up existing row for merging.
                mergedRow = mergedRows[vuln]
                # Append values of host and output fields, if they're new.
                if newRow[4] not in mergedRow[4]:
                    mergedRow[4].append(newRow[4])
                if newRow[12] not in mergedRow[12]:
                    mergedRow[12].append(newRow[12])

        # Flatten the lists of host and output field values into strings.
        for row in mergedRows.values():
            row[4] = ' '.join(row[4])
            row[12] = ' // '.join(row[12])

        # Place findings into lists: crits, highs, meds, lows for sorting later
        crits = []
        highs = []
        meds = []
        lows = []

        for row in mergedRows.values():

            if row[3] == "Critical":
                crits.append(row)
            elif row[3] == "High":
                highs.append(row)
            elif row[3] == "Medium":
                meds.append(row)
            elif row[3] == "Low":
                lows.append(row)

        # Open an output file for writing
        with open('output.csv', 'wb') as outFile:
            fileWriter = csv.writer(outFile)

            # Add in findings from lists in order of severity. Only relevant columns included
            for c in crits:
                fileWriter.writerow( (c[3], c[4], c[7], c[12]) )

            for h in highs:
                fileWriter.writerow( (h[3], h[4], h[7], h[12]) )

            for m in meds:
                fileWriter.writerow( (m[3], m[4], m[7], m[12]) )

            for l in lows:
                fileWriter.writerow( (l[3], l[4], l[7], l[12]) )


# Input validation
if len(sys.argv) != 2:
    print 'You must provide a csv file to process'
    raw_input('Example: python nesscsv.py foo.csv')
else:
    print "Working..."
    # Store filename for use in manipulate function
    inFile = str(sys.argv[1])
    # Call manipulate function passing csv
    manipulate(inFile)

print("Done!")
raw_input("Output in output.csv. Hit return to finish.")

Ya wee beauty! I suspected that this could be done using a dictionary somehow, just not experienced enough to figure it out myself. I'll give it a try now and come back to you. Sorry for the delay in response, had no connection over the weekend.

Collectives™ on Stack Overflow

Merging CSV rows using Python2 and keeping data from a single arbitrary column

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related