2

new but excited about Python and i need your advice. I came up with the following code to compare two CSV files based on nmap scan:

import pandas as pd
from pandas import DataFrame
import os
file = raw_input('\nEnter the Old CSV file: ')
file1 = raw_input('\nEnter the New CSV file: ')
A=set(pd.read_csv(file, index_col=False, header=None)[0])
B=set(pd.read_csv(file1, index_col=False, header=None)[0])
final=list(A-B)
df = pd.DataFrame(final, columns=["host"])
df.to_csv('DIFF_'+file)

print "Completed!"

when i run it i got the following results: ,

host
0,82.214.228.71;dsl-radius-02.direcpceu.com;PTR;tcp;111;rpcbind;open;;;syn-ack;;3;
1,82.214.228.70;dsl-radius-01.direcpceu.com;PTR;tcp;111;rpcbind;open;;;syn-ack;;3;

My question is how to add a label/enter code herename on the columns 2,3 etc for example: hostanme , port , port name ,state etc. I have tried : df['hostname'] = range(1, len(df) + 1) but this adds the hostname on the first column along with host when i open the file with Excel

1
  • Do you want compare all columns or only first? Commented Aug 14, 2017 at 11:01

2 Answers 2

3

I think you need read_csv with parameter sep=',' and names for define columns names first:

file = raw_input('\nEnter the Old CSV file: ')
file1 = raw_input('\nEnter the New CSV file: ')

cols = ['hostname','port','portname', ...]
A= pd.read_csv(file, index_col=False, header=None, sep=';', names=cols)
B= pd.read_csv(file1, index_col=False, header=None, sep=';', names=cols)

Then use merge with comparing by boolean indexing if need compare all columns:

df = pd.merge(A, B, how='outer', indicator=True)
df = df[df['_merge']=='left_only'].drop('_merge',axis=1)

df.to_csv('DIFF_'+file)

print "Completed!"

Sample:

import pandas as pd
from pandas.compat import StringIO

temp=u"""82.214.228.71;dsl-radius-02.direcpceu.com;PTR;tcp;111;rpcbind;open;;;syn-ack;;3;
82.214.228.70;dsl-radius-01.direcpceu.com;PTR;tcp;111;rpcbind;open;;;syn-ack;;3;
82.214.228.74;dsl-radius-02.direcpceu.com;PTR;tcp;111;rpcbind;open;;;syn-ack;;3;
82.214.228.75;dsl-radius-01.direcpceu.com;PTR;tcp;111;rpcbind;open;;;syn-ack;;3;"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
cols = ['hostname','port','portname', 'a','b','c','d','e','f','g','h','i', 'j']
A = pd.read_csv(StringIO(temp), sep=";", names=cols)
print (A)
        hostname                         port portname    a    b        c  \
0  82.214.228.71  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   
1  82.214.228.70  dsl-radius-01.direcpceu.com      PTR  tcp  111  rpcbind   
2  82.214.228.74  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   
3  82.214.228.75  dsl-radius-01.direcpceu.com      PTR  tcp  111  rpcbind   

      d   e   f        g   h  i   j  
0  open NaN NaN  syn-ack NaN  3 NaN  
1  open NaN NaN  syn-ack NaN  3 NaN  
2  open NaN NaN  syn-ack NaN  3 NaN  
3  open NaN NaN  syn-ack NaN  3 NaN  

temp=u"""82.214.228.75;dsl-radius-02.direcpceu.com;PTR;tcp;111;rpcbind;open;;;syn-ack;;3;
82.214.228.70;dsl-radius-01.direcpceu.com;PTR;tcp;111;rpcbind;open;;;syn-ack;;3;
82.214.228.77;dsl-radius-02.direcpceu.com;PTR;tcp;111;rpcbind;open;;;syn-ack;;3;
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
cols = ['hostname','port','portname', 'a','b','c','d','e','f','g','h','i', 'j']
B = pd.read_csv(StringIO(temp), sep=";", names=cols)
print (B)
        hostname                         port portname    a    b        c  \
0  82.214.228.75  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   
1  82.214.228.70  dsl-radius-01.direcpceu.com      PTR  tcp  111  rpcbind   
2  82.214.228.77  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   

      d   e   f        g   h  i   j  
0  open NaN NaN  syn-ack NaN  3 NaN  
1  open NaN NaN  syn-ack NaN  3 NaN  
2  open NaN NaN  syn-ack NaN  3 NaN 

df1 = pd.merge(A, B, how='outer', indicator=True)

print (df1)

        hostname                         port portname    a    b        c  \
0  82.214.228.71  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   
1  82.214.228.70  dsl-radius-01.direcpceu.com      PTR  tcp  111  rpcbind   
2  82.214.228.74  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   
3  82.214.228.75  dsl-radius-01.direcpceu.com      PTR  tcp  111  rpcbind   
4  82.214.228.75  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   
5  82.214.228.77  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   

      d   e   f        g   h  i   j      _merge  
0  open NaN NaN  syn-ack NaN  3 NaN   left_only  
1  open NaN NaN  syn-ack NaN  3 NaN        both  
2  open NaN NaN  syn-ack NaN  3 NaN   left_only  
3  open NaN NaN  syn-ack NaN  3 NaN   left_only  
4  open NaN NaN  syn-ack NaN  3 NaN  right_only  
5  open NaN NaN  syn-ack NaN  3 NaN  right_only  
#only values in A
df1 = df1[df1['_merge']=='left_only'].drop('_merge',axis=1)
print (df1)
        hostname                         port portname    a    b        c  \
0  82.214.228.71  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   
2  82.214.228.74  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   
3  82.214.228.75  dsl-radius-01.direcpceu.com      PTR  tcp  111  rpcbind   

      d   e   f        g   h  i   j  
0  open NaN NaN  syn-ack NaN  3 NaN  
2  open NaN NaN  syn-ack NaN  3 NaN  
3  open NaN NaN  syn-ack NaN  3 NaN
#only values in B
df1 = pd.merge(A, B, how='outer', indicator=True)
df11 = df1[df1['_merge']=='right_only'].drop('_merge',axis=1)
print (df11)
        hostname                         port portname    a    b        c  \
4  82.214.228.75  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   
5  82.214.228.77  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   

      d   e   f        g   h  i   j  
4  open NaN NaN  syn-ack NaN  3 NaN  
5  open NaN NaN  syn-ack NaN  3 NaN 
#same values in both dataframes
df12 = df1[df1['_merge']=='both'].drop('_merge',axis=1)
print (df12)
        hostname                         port portname    a    b        c  \
1  82.214.228.70  dsl-radius-01.direcpceu.com      PTR  tcp  111  rpcbind   

      d   e   f        g   h  i   j  
1  open NaN NaN  syn-ack NaN  3 NaN  

But if need compare only first column hostname use isin for mask, ~ for inverting with boolean indexing:

df2 = A[~A['hostname'].isin(B['hostname'])]
print (df2)
        hostname                         port portname    a    b        c  \
0  82.214.228.71  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   
2  82.214.228.74  dsl-radius-02.direcpceu.com      PTR  tcp  111  rpcbind   

      d   e   f        g   h  i   j  
0  open NaN NaN  syn-ack NaN  3 NaN  
2  open NaN NaN  syn-ack NaN  3 NaN  
Sign up to request clarification or add additional context in comments.

11 Comments

hey Jez.Thanks! WIll try as well and get back
Yes, sure. small notice - if csv has csv header also, remove parameter header=None and parametr names
Perfect Jez! worked like a charm! Had just to add sep=';' on the writing statement : df.to_csv('DIFF_'+file , sep=';') and i got what i wanted :).I am accpeting this answer and just one more thing if you dont mind. I am getting the following: host hostname hostname_type protocol port \ 24 82.214.228.70 dsl-radius-01.direcpceu.com PTR tcp 111 32 82.214.228.71 dsl-radius-02.direcpceu.com PTR tcp 111
was thinking the same ..:).All set ! Thank you
df1['_merge']=='both' to df1['_merge']!='both' for select right or left only.
|
1

You can add the labels where you are defining the dataframe. For example, the following should work

df = pd.DataFrame(final, columns=["host"].append([x for x in range(1, len(df) + 1)] ))

1 Comment

Thanks Amit! Will try and get back

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.