extract specific data from numpy ndarray

Question

New to pandas any help is appreciated

def csv_reader(fileName):
    reqcols=['_id__$oid','payload','channel']
    io = pd.read_csv(fileName,sep=",",usecols=reqcols)
    print(io['payload'].values)
    return io

Output row of io['payload']:

{
    "destination_ip": "172.31.14.66",
    "date": "2014-10-19T01:32:36.669861",
    "classification": "Potentially Bad Traffic",
    "proto": "UDP",
    "source_ip": "172.31.0.2",
    "priority": "`2",
    "header": "1:2003195:5",
    "signature": "ET POLICY Unusual number of DNS No Such Name Responses ",
    "source_port": "53",
    "destination_port": "34638",
    "sensor": "5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"
}

I am trying to extract specific data from the ndarray object. What is the method that can be used to extract from the dataframe

"destination_ip": "172.31.13.124",
"proto": "ICMP",
"source_ip": "201.158.32.1",
"date": "2014-09-28T14:49:43.391463",
"sensor": "139cfdf2-471e-11e4-9ee4-0a0b6e7c3e9e"

Show us a sample of your input data.

John Zwinck
– John Zwinck

2017-04-02 03:47:59 +00:00
Commented Apr 2, 2017 at 3:47 — John Zwinck
– John Zwinck, Commented Apr 2, 2017 at 3:47
@JohnZwinck Please check the updated question

user1208523
– user1208523

2017-04-02 04:14:25 +00:00
Commented Apr 2, 2017 at 4:14 — user1208523
– user1208523, Commented Apr 2, 2017 at 4:14

jezrael · Accepted Answer · 2017-04-02 05:25:44Z

I think you need first convert string reperesentation of dicts to dictionaries in each row by json.loads or ast.literal_eval in column payload, then create new DataFrame by constructor, filter columns by subset, and if necessary add original columns by concat:

d = {'_id__$oid': ['542f8', '542f8', '542f8'], 'channel': ['snort_alert', 'snort_alert', 'snort_alert'], 'payload': ['{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}']}
reqcols=['_id__$oid','payload','channel']
df = pd.DataFrame(d)
print (df)
  _id__$oid      channel                                            payload
0     542f8  snort_alert  {"destination_ip":"172.31.14.66","date": "2014...
1     542f8  snort_alert  {"destination_ip":"172.31.14.66","date": "2014...
2     542f8  snort_alert  {"destination_ip":"172.31.14.66","date": "2014...

import json
import ast
df.payload = df.payload.apply(json.loads)
#another slowier solution
#df.payload = df.payload.apply(ast.literal_eval)

required = ["destination_ip", "proto", "source_ip", "date", "sensor"]
df1 = pd.DataFrame(df.payload.values.tolist())[required]
print (df1)
  destination_ip proto   source_ip                        date  \
0   172.31.14.66   UDP  172.31.0.2  2014-10-19T01:32:36.669861   
1   172.31.14.66   UDP  172.31.0.2  2014-10-19T01:32:36.669861   
2   172.31.14.66   UDP  172.31.0.2  2014-10-19T01:32:36.669861   

                                 sensor  
0  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  
1  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  
2  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  

df2 = pd.concat([df[['_id__$oid','channel']], df1], axis=1)
print (df2)
  _id__$oid      channel destination_ip proto   source_ip  \
0     542f8  snort_alert   172.31.14.66   UDP  172.31.0.2   
1     542f8  snort_alert   172.31.14.66   UDP  172.31.0.2   
2     542f8  snort_alert   172.31.14.66   UDP  172.31.0.2   

                         date                                sensor  
0  2014-10-19T01:32:36.669861  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  
1  2014-10-19T01:32:36.669861  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  
2  2014-10-19T01:32:36.669861  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e

Timings:

#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
print (df)

In [38]: %timeit pd.DataFrame(df.payload.apply(json.loads).values.tolist())[required]
1 loop, best of 3: 379 ms per loop

In [39]: %timeit pd.read_json('[{}]'.format(df.payload.str.cat(sep=',')))[required]
1 loop, best of 3: 528 ms per loop

In [40]: %timeit pd.DataFrame(df.payload.apply(ast.literal_eval).values.tolist())[required]
1 loop, best of 3: 1.98 s per loop

piRSquared · Accepted Answer · 2017-04-02 05:17:28Z

1

Using @jezrael's sample df

d = {'_id__$oid': ['542f8', '542f8', '542f8'], 'channel': ['snort_alert', 'snort_alert', 'snort_alert'], 'payload': ['{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}']}
df = pd.DataFrame(d)

solution

Smash all payloads together with a vecorized str.cat
Parse the whole thing at once with pd.read_json

cols = 'destination_ip proto source_ip date sensor'.split()
df.drop(
    'payload', 1
).join(
    pd.read_json('[{}]'.format(df.payload.str.cat(sep=',')))[cols]
)

answered Apr 2, 2017 at 5:17

piRSquared

296k68 gold badges509 silver badges654 bronze badges

4 Comments

jezrael Over a year ago

Interesting, I think your solution is faster, but not.

piRSquared Over a year ago

over 3 rows? Or more?

jezrael Over a year ago

check my answer.

piRSquared Over a year ago

@jezrael that's good information! I'm surprised. But I'm glad you tested it.

Stephen Rauch · Accepted Answer · 2017-04-02 03:51:11Z

It is fairly straight forward to access columns in pandas. Simply pass a list of the columns you need:

Code:

columns = ["destination_ip", "proto", "source_ip", "date", "sensor"]
extracted_data = df[columns]

Test Code:

data = {
    "destination_ip": "172.31.14.66",
    "date": "2014-10-19T01:32:36.669861",
    "classification": "Potentially Bad Traffic",
    "proto": "UDP",
    "source_ip": "172.31.0.2",
    "priority": "`2",
    "header": "1:2003195:5",
    "signature": "ET POLICY Unusual number of DNS No Such Name Responses ",
    "source_port": "53",
    "destination_port": "34638",
    "sensor": "5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"
}
df = pd.DataFrame([data, data])

columns = ["destination_ip", "proto", "source_ip", "date", "sensor"]
print(df[columns])

Results:

  destination_ip proto   source_ip                        date  \
0   172.31.14.66   UDP  172.31.0.2  2014-10-19T01:32:36.669861   
1   172.31.14.66   UDP  172.31.0.2  2014-10-19T01:32:36.669861   

                                 sensor  
0  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e  
1  5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e

John Zwinck · Accepted Answer · 2017-04-02 04:29:43Z

0

The issue is that payload is a column of your CSV input data, and it is a JSON string. So you first can read_csv() as you have done to parse the overall file, but the you need to parse each JSON object inside. Let's use this example data:

payload = pd.Series(['{"a":1, "b":2}', '{"b":4, "c":5}'])

Now make a single JSON string:

json = ','.join(payload).join('[]')

Which gives:

'[{"a":1, "b":2}, {"b":4, "c":5}]'

Then parse it:

pd.read_json(json)

To get:

     a  b    c
0  1.0  2  NaN
1  NaN  4  5.0

answered Apr 2, 2017 at 4:29

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

Collectives™ on Stack Overflow

extract specific data from numpy ndarray

4 Answers 4

Comments

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related