Efficient way to add rows to dataframe

Question

From this question and others it seems that it is not recommended to use concat or append to build a pandas dataframe because it is recopying the whole dataframe each time.

My project involves retrieving a small amount of data every 30 seconds. This might run for a 3 day weekend, so someone could easily expect over 8000 rows to be created one row at a time. What would be the most efficient way to add rows to this dataframe?

If you are only adding a row every 30 seconds, does it really need to be efficient? — Stephen Rauch
– Stephen Rauch ♦, Commented Jan 27, 2017 at 6:32
Is there any reason it needs to be a DataFrame? Why not just write it to a file and then convert at the end? — sundance
– sundance, Commented Jan 27, 2017 at 6:32
@Stephen Rauch Well, I was hoping to keep my samples as close to every 30 seconds as possible. Probably incorrectly, I am pulling the data then adding it to the dataframe then using time.sleep(30) until its time to get the next set of data. My worry was that it becomes larger that the load time will start to expand the time between each sample. From this question link it seems that at a size of 6000 it takes 2.29 seconds. I would like if possible to keep that number to a minimum. — Jarrod
– Jarrod, Commented Jan 27, 2017 at 6:43
If your concern is that the 30 second sleep will be inaccurate because it takes longer to append your data, then fix the sleep. next_time += 30, time.sleep(next_time-time.time()) — Stephen Rauch
– Stephen Rauch ♦, Commented Jan 27, 2017 at 6:47

wjandrea · Accepted Answer · 2024-01-16 00:39:19Z

84

I used this answer's df.loc[i] = [new_data] suggestion, but I have > 500,000 rows and that was very slow.

While the answers given are good for the OP's question, I found it more efficient, when dealing with large numbers of rows up front (instead of the trickling in described by the OP) to use csvwriter to add data to an in memory CSV object, then finally use pandas.read_csv(csv) to generate the desired DataFrame output.

from io import BytesIO
from csv import writer 
import pandas as pd

output = BytesIO()
csv_writer = writer(output)

for row in iterable_object:
    csv_writer.writerow(row)

output.seek(0) # we need to get back to the start of the BytesIO
df = pd.read_csv(output)
return df

This, for ~500,000 rows was 1000x faster and as the row count grows the speed improvement will only get larger (the df.loc[1] = [data] will get a lot slower comparatively)

edited Jan 16, 2024 at 0:39

wjandrea

33.9k10 gold badges69 silver badges105 bronze badges

answered Jan 16, 2018 at 18:09

Tom Harvey

4,4205 gold badges23 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

matanox Over a year ago

Could one alternatively, efficiently use an in-memory structure or CSV instead of actually writing a CSV to file?

Floran Gmehlin Over a year ago

Great ! I tested and can confirm that this is much faster.

Romi Kuntsman Over a year ago

Note: for Python 3 need to use StringIO instead of BytesIO

Aku Over a year ago

Wow, thank you SO much for this. Much easier and faster than what I was doing. It brought the computation time from ~2 hours to 2 minutes!

Xixiaxixi Over a year ago

Why not just using output = [], output.append(row), pd.DataFrame(output)?

|

waterproof · Accepted Answer · 2019-07-25 16:25:32Z

38

Editing the chosen answer here since it was completely mistaken. What follows is an explanation of why you should not use setting with enlargement. "Setting with enlargement" is actually worse than append.

The tl;dr here is that there is no efficient way to do this with a DataFrame, so if you need speed you should use another data structure instead. See other answers for better solutions.

More on setting with enlargement

You can add rows to a DataFrame in-place using loc on a non-existent index, but that also performs a copy of all of the data (see this discussion). Here's how it would look, from the Pandas documentation:

In [119]: dfi
Out[119]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4

In [120]: dfi.loc[3] = 5

In [121]: dfi
Out[121]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5

For something like the use case described, setting with enlargement actually takes 50% longer than append:

With append(), 8000 rows took 6.59s (0.8ms per row)

%%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4})
for i in range(8000):
    df = df.append(new_row, ignore_index=True)

# 6.59 s ± 53.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With .loc(), 8000 rows took 10s (1.25ms per row)

%%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4})
for i in range(8000):
    df.loc[i] = new_row

# 10.2 s ± 148 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

What about a longer DataFrame?

As with all profiling in data-oriented code, YMMV and you should test this for your use case. One characteristic of the copy-on-write behavior of append and "setting with enlargement" is that it will get slower and slower with large DataFrames:

%%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4})
for i in range(16000):
    df.loc[i] = new_row

# 23.7 s ± 286 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Building a 16k row DataFrame with this method takes 2.3x longer than 8k rows.

edited Jul 25, 2019 at 16:25

waterproof

5,1935 gold badges34 silver badges31 bronze badges

answered Jan 27, 2017 at 6:40

sundance

2,9554 gold badges23 silver badges31 bronze badges

6 Comments

Jarrod Over a year ago

Thank you, this looks to be much better than what I was using. I appreciate the help!

matanox Over a year ago

Anything less hackish than assuming some index will never exist?

Moobie Over a year ago

Of course the latter is faster. The first iteration adds a new row, and all subsequent operations write to the same row with index 3. The index has to be incremented. You'd also need df = df.append(df2) to make the comparison fair.

Nielsou Akbrg Over a year ago

Also it could be a good idea to test a reindexation with the new rows (with pandas.reindex) and then copy the new data with np.array.

Giacomo Catenazzi Over a year ago

@waterproof: please do not edit answers. It an answer is wrong, just add an answer. You should never change the meaning of an answer with edit.

|

wjandrea · Accepted Answer · 2024-01-16 00:48:48Z

10

Tom Harvey's solution works well. However, I would like to add a simpler solution based on pandas.DataFrame.from_dict.

By adding the data of a row in a list and then this list to a dictionary, you can then use .from_dict(dict) to create a dataframe without iteration.

If each value of the dictionary is a row, you can use just:

pd.DataFrame.from_dict(dictionary, orient='index')

Small example:

# Dictionary containing the data
dic = {
    'row_1': ['some', 'test', 'values', 78, 90],
    'row_2': ['some', 'test', 'values', 100, 589]}

# Creation of the dataframe
df = pd.DataFrame.from_dict(dic, orient='index')
df

          0     1       2    3    4
row_1  some  test  values   78   90
row_2  some  test  values  100  589

edited Jan 16, 2024 at 0:48

wjandrea

33.9k10 gold badges69 silver badges105 bronze badges

answered Oct 22, 2019 at 11:06

Theo

2691 gold badge4 silver badges13 bronze badges

3 Comments

Theo Over a year ago

It is very fast even for large dictionary.

AleB Over a year ago

What do you mean by large? I need to find an alternative of enlarging/growing a pandas DataFrame with one million rows. Do you think this could be more efficient?

Theo Over a year ago

I use it for a dataframe of 12 million rows. And it works perfectly. Dictionaries are perfect for large dataset because when using a dictionary the average time complexity of adding or removing rows is O(1).

Burhan Khalid · Accepted Answer · 2017-01-27 06:40:07Z

7

You need to split the problem into two parts:

Accepting the data (collecting it) every 30 seconds efficiently.
Processing the data once its collected.

If your data is critical (that is, you cannot afford to lose it) - send it to a queue and then read it from the queue in batches.

The queue will provide reliable (guaranteed) acceptance and that your data will not be lost.

You can read the data from the queue and dump it in a database.

Now your Python app simply reads from the database and does the analysis at whatever interval makes sense for the application - perhaps you want to do hourly averages; in this case you would run your script each hour to pull the data from the db and perhaps write the results in another database / table / file.

The bottom line - split the collecting and analyzing parts of your application.

answered Jan 27, 2017 at 6:40

Burhan Khalid

175k20 gold badges254 silver badges291 bronze badges

1 Comment

Jarrod Over a year ago

This is a great idea! Probably a bit outside of my skill level at the moment but this just gives me so many good ideas!!! I think after I get it up and running I will try and make something like this happen. Thank you!

sparrow · Accepted Answer · 2019-02-06 16:46:41Z

2

Assuming that your dataframe is indexed in order you can:

First check to see what the next index value is to create a new row:

myindex = df.shape[0]+1

Then use 'at' to write to each desired column

df.at[myindex,'A']=val1
df.at[myindex,'B']=val2
df.at[myindex,'C']=val3

answered Feb 6, 2019 at 16:46

sparrow

11.6k12 gold badges61 silver badges76 bronze badges

Comments

Nguai al · Accepted Answer · 2020-08-04 21:31:24Z

2

I had 700K rows of data returned from SQL server. All of the above took too long for me. The following approach cut the time significantly.

from collections import defaultdict
dict1 = defaultdict(list)

for row in results:

   dict1['column_name1'] = row['column_name1']


   dict1['column_name20'] = row['column_name20']

df = pd.DataFrame(dict1)

This is all I needed.

edited Aug 4, 2020 at 21:31

answered Aug 4, 2020 at 21:25

Nguai al

9585 silver badges18 bronze badges

Comments

wjandrea · Accepted Answer · 2024-01-16 00:52:08Z

sundance's answer might be correct in terms of usage, but the benchmark is just wrong. As correctly pointed out by moobie, an index 3 already exists in this example, which makes access way quicker than with a non-existent index. Have a look at this:

%%timeit
test = pd.DataFrame({"A": [1,2,3], "B": [1,2,3], "C": [1,2,3]})
for i in range(0,1000):
    testrow = pd.DataFrame([0,0,0])
    pd.concat([test[:1], testrow, test[1:]])

2.15 s ± 88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
test = pd.DataFrame({"A": [1,2,3], "B": [1,2,3], "C": [1,2,3]})
for i in range(0,1000):
    test2 = pd.DataFrame({'A': 0, 'B': 0, 'C': 0}, index=[i+0.5])
    test.append(test2, ignore_index=False)
test.sort_index().reset_index(drop=True)

972 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
test = pd.DataFrame({"A": [1,2,3], "B": [1,2,3], "C": [1,2,3]})
for i in range(0,1000):
    test3 = [0,0,0]
    test.loc[i+0.5] = test3
test.reset_index(drop=True)

1.13 s ± 46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Of course, this is purely synthetic, and I admittedly wasn't expecting these results, but it seems that with non-existent indices .loc and .append perform quite similarly. Just leaving this here.

Tom Renish · Accepted Answer · 2022-01-13 21:39:13Z

My coworker told me to make a list of dictionary entries, then push the finished list into a dataframe. Compared to pushing one dictionary at a time into a dataframe, the list approach was instantaneous.

This code culled through ~54k records and looks only for those after my targ_datetime value, then writes the desired value back to a list, and then to df_out:

df_out = pd.DataFrame()
df_len = df.count()
counter = 1
list_out = []
targ_datetime = datetime.datetime.fromisoformat('2021-12-30 00:00:00')
for rec in df.selectExpr("CAST(data as STRING) as data").take(df_len):
  j = jsonx.loads(rec[0])
  NewImage = j['dynamodb']['NewImage']
  NewImage['eventName'] = j['eventName']
  if j.get('dynamodb').get('NewImage').get('UPDATED_AT') != None:
    ts = datetime.datetime.fromisoformat(str(j['dynamodb']['NewImage']['UPDATED_AT']).replace('T', ' ')[0:-5])
  else:
    ts = datetime.datetime.fromtimestamp(j['dynamodb']['ApproximateCreationDateTime']/1000)
  if ts >= targ_datetime:
    #df_out = df_out.append(pd.Series(NewImage.values(), index=NewImage.keys()), ignore_index=True)
    j['dynamodb']['NewImage']['UPDATED_AT'] = ts
    list_out.append(NewImage)
    counter = counter +1
  #if counter > 10: break
df_out = pd.DataFrame(list_out)

Andrey Putilov · Accepted Answer · 2025-02-11 07:51:23Z

Strange, no one suggested practical piece of code. So, we see, pandas makes a copy, rearranges memory, checks if index is unique when each row is added... This all is super slow.

So we'll be adding our data to some native python structures (lists and dicts), and we'll turn it into a dataframe as the last step.

df = pd.DataFrame(
    data = {
        'My primary column': [],
        'Another columns': [],
        'One more column': [],
    },
    index = None # even if you need it indexed, we can set it later
)
data_add = [] # we'll be adding records here, it's fast;
# then we'll just append it to a dataframe in one op, it's fast

for record in read_data_source:
    # ... build a row with every column, I have it as list
    row_add_list = [ col1, col2, ... ]
    # we need it as dict, and it's easy to convert
    row_add = { col[0]: col[1] for col in zip(df.columns,row_add_list) }
    data_add.append(row_add)

# now we convert it back to pandas dataframe object
df = pd.concat(
    [
        df,
        pd.DataFrame(data_add)
    ],
    ignore_index = True, # it's fast to ignore index, so we just append
)
# I said, maybe we need index
# if we need, we can easily set it now, one time
# so it is not causing slowness when adding every row
df.set_index('My primary column',inplace=True)
return df

Now you only have to wait couple minutes when the file is saved (maybe it's because my file is quite complex with formulas and formatting); constructing data is now fast

Collectives™ on Stack Overflow

Efficient way to add rows to dataframe

9 Answers 9

8 Comments

More on setting with enlargement

What about a longer DataFrame?

6 Comments

3 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

8 Comments

More on setting with enlargement

What about a longer DataFrame?

6 Comments

3 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related