0

I'm trying to plot performance metrics of various assets in a back test.

I have imported the 'test_predictions.json' into a pandas data frame. It is a list of dictionaries and contains results from various asset (listed one after the other), Here is a sample is the data:

trading_pair return  timestamp prediction
[u'Poloniex_ETH_BTC' 0.003013302628677 1450753200L -0.157053292753482]
[u'Poloniex_ETH_BTC' 0.006013302628677 1450753206L -0.187053292753482]
...
[u'Poloniex_FCT_BTC' 0.006013302628677 1450753100L 0.257053292753482] 

Each backtest starts and ends at different times.

Here' is the data for the assets of interest

'''
#These are the assets I would like to analyse
Poloniex_DOGE_BTC 2015-10-21 02:00:00 1445392800
Poloniex_DOGE_BTC 2016-01-12 05:00:00 1452574800

Poloniex_XRP_BTC 2015-10-28 06:00:00 1446012000
Poloniex_XRP_BTC 2016-01-12 05:00:00 1452574800

Poloniex_XMR_BTC 2015-10-21 14:00:00 1445436000
Poloniex_XMR_BTC 2016-01-12 06:00:00 1452578400

Poloniex_VRC_BTC 2015-10-25 07:00:00 1445756400
Poloniex_VRC_BTC 2016-01-12 00:00:00 1452556800

'''

So i'm trying to make an new array that contains the data for these assets. Each asset must be sliced appropriately so they all start from the latest start time and end at earliest end time (other wise there will be incomplete data).

#each array should start and end:  
#start 2015-10-28 06:00:00
#end 2016-01-12 00:00:00

So the question is:

How can I search for an asset ie Poloniex_DOGE_BTC then acquire the index for start and end times specified above ?

I will be plotting the data via numpy so maybe its better turn into a numpy array, df.valuesand the conduct the search? Then i could use np.hstack(df_index_asset1, def_index_asset2) so it's in the right form to plot. So the problem is: using either pandas or numpy how do i retrieve the data for the specified assets which fall into the master start and end times?

On a side note here the code i wrote to get the start and end dates, it's not to most efficient so improving that would be a bonus.

EDIT:

From Kartik's answer I tried to obtain just the data for asset name: 'Poloniex_DOGE_BTC' using the follow code:

import pandas as pd
import numpy as np

preds = 'test_predictions.json'

df = pd.read_json(preds)

asset = 'Poloniex_DOGE_BTC'

grouped = df.groupby(asset)

print grouped

But throws this error

EDIT2: I have changed the link to the data so it is test_predictions.json`

EDIT3: this worked a treat:

preds = 'test_predictions.json'

df = pd.read_json(preds)

asset = 'Poloniex_DOGE_BTC'

grouped = df.groupby('market_trading_pair')
print grouped.get_group(asset)`

#each array should start and end: 
#start 2015-10-28 06:00:00 1446012000
#end 2016-01-12 00:00:00 1452556800 

Now how can we truncate the data so that it starts and ends from the above timestamps ?

5
  • I think you share wrong file - backtest.txt instead back_test.json. Commented Jan 28, 2016 at 6:36
  • Sorry, but It is a list of dictionaries is wrong. It is not valid json, so pd.read_json doesnt work. Commented Jan 28, 2016 at 7:12
  • Ah ok this is interesting. The file i'm reading in is called test_predictions.json and I think it is a JSON, further more using my current code I was able to look at data in the df. Commented Jan 28, 2016 at 8:06
  • Hmmm, it is much better. Commented Jan 28, 2016 at 8:13
  • You should take a minute to read the Pandas groupby docs: pandas.pydata.org/pandas-docs/stable/groupby.html Commented Jan 28, 2016 at 8:18

1 Answer 1

1

Firstly, why like this?

data = pd.read_json(preds).values
df = pd.DataFrame(data)

You can just write that as:

df = pd.read_json(preds)

And if you want a NumPy array from df then you can execute data = df.values later.

And it should put the data in a DataFrame. (Unless I am much mistaken, because I have never used read_json() before.

The second thing, is getting the data for each asset out. For that, I am assuming you need to process all assets. To do that, you can simply do:

# To convert it to datetime.
# This is not important, and you can skip it if you want, because epoch times in
# seconds will perfectly work with the rest of the method.
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')

# This will give you a group for each asset on which you can apply some function.
# We will apply min and max to get the desired output.
grouped = df.groupby('trading_pair') # Where 'trading_pair' is the name of the column that has the asset names
start_times = grouped['timestamp'].min
end_times = grouped['timestamp'].max

Now start_times and end_times will be Series. The index of this series will be your asset names, and the value will be the minimum and maximum times respectively.

I think this is the answer you are looking for, from my understanding of your question. Please let me know if that is not the case.

EDIT

If you are specifically looking for a few (one or two or ten) assets, you can modify the above code like so:

asset = ['list', 'of', 'required', 'assets'] # Even one element is fine.
req_df = df[df['trading_pair'].isin(asset)]

grouped = req_df.groupby('trading_pair') # Where 'trading_pair' is the name of the column that has the asset
start_times = grouped['timestamp'].min
end_times = grouped['timestamp'].max

EDIT2 this worked a treat:

preds = 'test_predictions.json'

df = pd.read_json(preds)

asset = 'Poloniex_DOGE_BTC'

grouped = df.groupby('market_trading_pair')
print grouped.get_group(asset)`

#each array should start and end: 
#start 2015-10-28 06:00:00 1446012000
#end 2016-01-12 00:00:00 1452556800 

Now how can we truncate the data for that it starts from the above starts and ends at the above timestamps ?


As an aside, plotting datetimes from Pandas is very convenient as well. I use it all the time to produce most of the plots I create. And all of my data is timestamped.

Sign up to request clarification or add additional context in comments.

8 Comments

Hey Kartik, thanks alot for the reply. This is exactly what I needed. However when I try the groupby function, it throws an error. Could you look at my EDIT in the main post, thanks.
Yes obviously it does! Groupby takes DataFrame column(s) to group on. 'trading_pair' is the name of your column with asset names, correct? If you just want a single group output, you would have to execute: grouped = df.groupby('trading_pair'); grouped.get_group(asset)...
Ok great we're almost there, i have posted another EDIT :)
Ah I see we can use DataFrame.truncate(min, max) maybe ? i'm just experimenting with it now
I do not see any new edits. Anyway, I'm off to sleep. I'll help you tomorrow.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.