How to filter this data in pandas data frame or numpy array?

Question

I'm trying to plot performance metrics of various assets in a back test.

I have imported the 'test_predictions.json' into a pandas data frame. It is a list of dictionaries and contains results from various asset (listed one after the other), Here is a sample is the data:

trading_pair return  timestamp prediction
[u'Poloniex_ETH_BTC' 0.003013302628677 1450753200L -0.157053292753482]
[u'Poloniex_ETH_BTC' 0.006013302628677 1450753206L -0.187053292753482]
...
[u'Poloniex_FCT_BTC' 0.006013302628677 1450753100L 0.257053292753482]

Each backtest starts and ends at different times.

Here' is the data for the assets of interest

'''
#These are the assets I would like to analyse
Poloniex_DOGE_BTC 2015-10-21 02:00:00 1445392800
Poloniex_DOGE_BTC 2016-01-12 05:00:00 1452574800

Poloniex_XRP_BTC 2015-10-28 06:00:00 1446012000
Poloniex_XRP_BTC 2016-01-12 05:00:00 1452574800

Poloniex_XMR_BTC 2015-10-21 14:00:00 1445436000
Poloniex_XMR_BTC 2016-01-12 06:00:00 1452578400

Poloniex_VRC_BTC 2015-10-25 07:00:00 1445756400
Poloniex_VRC_BTC 2016-01-12 00:00:00 1452556800

'''

So i'm trying to make an new array that contains the data for these assets. Each asset must be sliced appropriately so they all start from the latest start time and end at earliest end time (other wise there will be incomplete data).

#each array should start and end:  
#start 2015-10-28 06:00:00
#end 2016-01-12 00:00:00

So the question is:

How can I search for an asset ie Poloniex_DOGE_BTC then acquire the index for start and end times specified above ?

I will be plotting the data via numpy so maybe its better turn into a numpy array, df.valuesand the conduct the search? Then i could use np.hstack(df_index_asset1, def_index_asset2) so it's in the right form to plot. So the problem is: using either pandas or numpy how do i retrieve the data for the specified assets which fall into the master start and end times?

On a side note here the code i wrote to get the start and end dates, it's not to most efficient so improving that would be a bonus.

EDIT:

From Kartik's answer I tried to obtain just the data for asset name: 'Poloniex_DOGE_BTC' using the follow code:

import pandas as pd
import numpy as np

preds = 'test_predictions.json'

df = pd.read_json(preds)

asset = 'Poloniex_DOGE_BTC'

grouped = df.groupby(asset)

print grouped

But throws this error

EDIT2: I have changed the link to the data so it is test_predictions.json`

EDIT3: this worked a treat:

preds = 'test_predictions.json'

df = pd.read_json(preds)

asset = 'Poloniex_DOGE_BTC'

grouped = df.groupby('market_trading_pair')
print grouped.get_group(asset)`

#each array should start and end: 
#start 2015-10-28 06:00:00 1446012000
#end 2016-01-12 00:00:00 1452556800

Now how can we truncate the data so that it starts and ends from the above timestamps ?

I think you share wrong file - backtest.txt instead back_test.json. — jezrael
– jezrael, Commented Jan 28, 2016 at 6:36
Sorry, but It is a list of dictionaries is wrong. It is not valid json, so pd.read_json doesnt work. — jezrael
– jezrael, Commented Jan 28, 2016 at 7:12
Ah ok this is interesting. The file i'm reading in is called test_predictions.json and I think it is a JSON, further more using my current code I was able to look at data in the df. — David Hancock
– David Hancock, Commented Jan 28, 2016 at 8:06
You should take a minute to read the Pandas groupby docs: pandas.pydata.org/pandas-docs/stable/groupby.html — Kartik
– Kartik, Commented Jan 28, 2016 at 8:18

David Hancock · Accepted Answer · 2016-01-28 09:52:16Z

1

Firstly, why like this?

data = pd.read_json(preds).values
df = pd.DataFrame(data)

You can just write that as:

df = pd.read_json(preds)

And if you want a NumPy array from df then you can execute data = df.values later.

And it should put the data in a DataFrame. (Unless I am much mistaken, because I have never used read_json() before.

The second thing, is getting the data for each asset out. For that, I am assuming you need to process all assets. To do that, you can simply do:

# To convert it to datetime.
# This is not important, and you can skip it if you want, because epoch times in
# seconds will perfectly work with the rest of the method.
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')

# This will give you a group for each asset on which you can apply some function.
# We will apply min and max to get the desired output.
grouped = df.groupby('trading_pair') # Where 'trading_pair' is the name of the column that has the asset names
start_times = grouped['timestamp'].min
end_times = grouped['timestamp'].max

Now start_times and end_times will be Series. The index of this series will be your asset names, and the value will be the minimum and maximum times respectively.

I think this is the answer you are looking for, from my understanding of your question. Please let me know if that is not the case.

EDIT

If you are specifically looking for a few (one or two or ten) assets, you can modify the above code like so:

asset = ['list', 'of', 'required', 'assets'] # Even one element is fine.
req_df = df[df['trading_pair'].isin(asset)]

grouped = req_df.groupby('trading_pair') # Where 'trading_pair' is the name of the column that has the asset
start_times = grouped['timestamp'].min
end_times = grouped['timestamp'].max

EDIT2 this worked a treat:

preds = 'test_predictions.json'

df = pd.read_json(preds)

asset = 'Poloniex_DOGE_BTC'

grouped = df.groupby('market_trading_pair')
print grouped.get_group(asset)`

#each array should start and end: 
#start 2015-10-28 06:00:00 1446012000
#end 2016-01-12 00:00:00 1452556800

Now how can we truncate the data for that it starts from the above starts and ends at the above timestamps ?

As an aside, plotting datetimes from Pandas is very convenient as well. I use it all the time to produce most of the plots I create. And all of my data is timestamped.

edited Jan 28, 2016 at 9:52

David Hancock

1,0794 gold badges16 silver badges29 bronze badges

answered Jan 28, 2016 at 7:41

Kartik

8,73345 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

David Hancock Over a year ago

Hey Kartik, thanks alot for the reply. This is exactly what I needed. However when I try the groupby function, it throws an error. Could you look at my EDIT in the main post, thanks.

Kartik Over a year ago

Yes obviously it does! Groupby takes DataFrame column(s) to group on. 'trading_pair' is the name of your column with asset names, correct? If you just want a single group output, you would have to execute: grouped = df.groupby('trading_pair'); grouped.get_group(asset)...

David Hancock Over a year ago

Ok great we're almost there, i have posted another EDIT :)

David Hancock Over a year ago

Ah I see we can use DataFrame.truncate(min, max) maybe ? i'm just experimenting with it now

Kartik Over a year ago

I do not see any new edits. Anyway, I'm off to sleep. I'll help you tomorrow.

|

Collectives™ on Stack Overflow

How to filter this data in pandas data frame or numpy array?

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related