0

I have:

  • NewData, a pd.DataFrame to be populated from
  • SourceData, a list of dataframes holding source data and
  • source, a dataframe holding index values for which dataframe in SourceData is to be assigned.
  • indexlen, an integer for the length of indexes in the dataframes

(Using dataframes because it's critical the indexes align.)

For instance, assume that there are 1000 df's in SourceData, and indexlen is 10,000. At 10,000, I will be assigning all columns from SourceData to NewData, moving up the indexes (es because all df's share the same index) until source decrements, at which point I will start assigning the values from all columns in the dataframe in SourceData[999] to NewData, etc.

I'm currently doing this with a loop:

for j in range(1, indexlen + 1):
    NewData[j] = SourceData[source[j]].ix[j,:]

I would like to do this without using a loop, but I don't know how to broadcast this. I'm sure I'm missing something obvious, but any help would be appreciate. Thank you!

Edit: I made source a list, because I figured that was more efficient to access by element.

In response to a question about the dataframes, they are standard price data:

>>>SourceData[1].head()

bpz1975     Open    High    Low     Close   Vol     OI
1975-02-13  2.275   2.275   2.275   2.275   0   50
1975-02-14  2.275   2.275   2.275   2.275   0   50
1975-02-18  2.275   2.275   2.275   2.275   0   50
1975-02-19  2.290   2.290   2.290   2.290   0   50
1975-02-20  2.290   2.290   2.290   2.290   0   50

In this case, reading in all months of a futures contract and then applying roll logic to create a series.

3
  • Do you have some samples for what your data frames look like? Commented Feb 14, 2014 at 1:45
  • edited question with a head() of one of the dfs. also, the indexes can well be >10,000 so memory may be an issue too if I don't do this efficiently. (As I think you can tell, my question is as much about good programming practice as this specific question, so any criticisms are welcome. Thanks!) Commented Feb 14, 2014 at 1:55
  • and i also just tried it with NewData as a list. Much, much faster. That solution is acceptable I think if there's no better way to do it. Commented Feb 14, 2014 at 2:03

1 Answer 1

1

Creating the DataFrame, and filling it in is not usually the fastest or most pandastic way.

In this case it looks like you can do a concat:

pd.concat(SourceData)

If you need to include source, the index information, within the DataFrames in SourceData, then I would do this before doing the concat.

It's unclear exactly what this entails, but it sounds like your suggesting to set the index for each frame based on source... you can create a function which passes over SourceData changing the index of each DataFrame with that from source (without seeing source it's unclear what exactly how).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.