3

New to pandas any help is appreciated

Snapshot of the dataset

def csv_reader(fileName):
    reqcols=['_id__$oid','payload','channel']
    io = pd.read_csv(fileName,sep=",",usecols=reqcols)
    print(io['payload'].values)
    return io  

Output row of io['payload']:

{
    "destination_ip": "172.31.14.66",
    "date": "2014-10-19T01:32:36.669861",
    "classification": "Potentially Bad Traffic",
    "proto": "UDP",
    "source_ip": "172.31.0.2",
    "priority": "`2",
    "header": "1:2003195:5",
    "signature": "ET POLICY Unusual number of DNS No Such Name Responses ",
    "source_port": "53",
    "destination_port": "34638",
    "sensor": "5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"
}

I am trying to extract specific data from the ndarray object. What is the method that can be used to extract from the dataframe

"destination_ip": "172.31.13.124",
"proto": "ICMP",
"source_ip": "201.158.32.1",
"date": "2014-09-28T14:49:43.391463",
"sensor": "139cfdf2-471e-11e4-9ee4-0a0b6e7c3e9e"
2
  • Show us a sample of your input data. Commented Apr 2, 2017 at 3:47
  • @JohnZwinck Please check the updated question Commented Apr 2, 2017 at 4:14

4 Answers 4

2

I think you need first convert string reperesentation of dicts to dictionaries in each row by json.loads or ast.literal_eval in column payload, then create new DataFrame by constructor, filter columns by subset, and if necessary add original columns by concat:

d = {'_id__$oid': ['542f8', '542f8', '542f8'], 'channel': ['snort_alert', 'snort_alert', 'snort_alert'], 'payload': ['{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}']}
reqcols=['_id__$oid','payload','channel']
df = pd.DataFrame(d)
print (df)
  _id__$oid      channel                                            payload
0     542f8  snort_alert  {"destination_ip":"172.31.14.66","date": "2014...
1     542f8  snort_alert  {"destination_ip":"172.31.14.66","date": "2014...
2     542f8  snort_alert  {"destination_ip":"172.31.14.66","date": "2014...

import json
import ast
df.payload = df.payload.apply(json.loads)
#another slowier solution
#df.payload = df.payload.apply(ast.literal_eval)

required = ["destination_ip", "proto", "source_ip", "date", "sensor"]
df1 = pd.DataFrame(df.payload.values.tolist())[required]
print (df1)
  destination_ip proto   source_ip                        date  \
0   172.31.14.66   UDP  172.31.0.2  2014-10-19T01:32:36.669861   
1   172.31.14.66   UDP  172.31.0.2  2014-10-19T01:32:36.669861   
2   172.31.14.66   UDP  172.31.0.2  2014-10-19T01:32:36.669861   

                                 sensor  
0  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  
1  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  
2  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  

df2 = pd.concat([df[['_id__$oid','channel']], df1], axis=1)
print (df2)
  _id__$oid      channel destination_ip proto   source_ip  \
0     542f8  snort_alert   172.31.14.66   UDP  172.31.0.2   
1     542f8  snort_alert   172.31.14.66   UDP  172.31.0.2   
2     542f8  snort_alert   172.31.14.66   UDP  172.31.0.2   

                         date                                sensor  
0  2014-10-19T01:32:36.669861  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  
1  2014-10-19T01:32:36.669861  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  
2  2014-10-19T01:32:36.669861  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  

Timings:

#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
print (df)

In [38]: %timeit pd.DataFrame(df.payload.apply(json.loads).values.tolist())[required]
1 loop, best of 3: 379 ms per loop

In [39]: %timeit pd.read_json('[{}]'.format(df.payload.str.cat(sep=',')))[required]
1 loop, best of 3: 528 ms per loop

In [40]: %timeit pd.DataFrame(df.payload.apply(ast.literal_eval).values.tolist())[required]
1 loop, best of 3: 1.98 s per loop
Sign up to request clarification or add additional context in comments.

Comments

1

Using @jezrael's sample df

d = {'_id__$oid': ['542f8', '542f8', '542f8'], 'channel': ['snort_alert', 'snort_alert', 'snort_alert'], 'payload': ['{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}']}
df = pd.DataFrame(d)

solution

  • Smash all payloads together with a vecorized str.cat
  • Parse the whole thing at once with pd.read_json

cols = 'destination_ip proto source_ip date sensor'.split()
df.drop(
    'payload', 1
).join(
    pd.read_json('[{}]'.format(df.payload.str.cat(sep=',')))[cols]
)

enter image description here

4 Comments

Interesting, I think your solution is faster, but not.
over 3 rows? Or more?
check my answer.
@jezrael that's good information! I'm surprised. But I'm glad you tested it.
0

It is fairly straight forward to access columns in pandas. Simply pass a list of the columns you need:

Code:

columns = ["destination_ip", "proto", "source_ip", "date", "sensor"]
extracted_data = df[columns]

Test Code:

data = {
    "destination_ip": "172.31.14.66",
    "date": "2014-10-19T01:32:36.669861",
    "classification": "Potentially Bad Traffic",
    "proto": "UDP",
    "source_ip": "172.31.0.2",
    "priority": "`2",
    "header": "1:2003195:5",
    "signature": "ET POLICY Unusual number of DNS No Such Name Responses ",
    "source_port": "53",
    "destination_port": "34638",
    "sensor": "5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"
}
df = pd.DataFrame([data, data])

columns = ["destination_ip", "proto", "source_ip", "date", "sensor"]
print(df[columns])

Results:

  destination_ip proto   source_ip                        date  \
0   172.31.14.66   UDP  172.31.0.2  2014-10-19T01:32:36.669861   
1   172.31.14.66   UDP  172.31.0.2  2014-10-19T01:32:36.669861   

                                 sensor  
0  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  
1  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  

Comments

0

The issue is that payload is a column of your CSV input data, and it is a JSON string. So you first can read_csv() as you have done to parse the overall file, but the you need to parse each JSON object inside. Let's use this example data:

payload = pd.Series(['{"a":1, "b":2}', '{"b":4, "c":5}'])

Now make a single JSON string:

json = ','.join(payload).join('[]')

Which gives:

'[{"a":1, "b":2}, {"b":4, "c":5}]'

Then parse it:

pd.read_json(json)

To get:

     a  b    c
0  1.0  2  NaN
1  NaN  4  5.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.