7

My objective is to make a call to an API for each row in a Pandas DataFrame, which contains a List of strings in the response JSON, and creating a new DataFrame with one row per response. My code basically looks like this:

i = 0
new_df = pandas.DataFrame(columns = ['a','b','c','d'])
for index,row in df.iterrows():
    url = 'http://myAPI/'
    d = '{"SomeJSONData:"' + row['data'] + '}'
    j = json.loads(d)
    response = requests.post(url,json = j)

    data = response.json()
    for new_data in data['c']:
        new_df.loc[i] = [row['a'],row['b'],row['c'],new_data]
        i += 1

This works fine, but I'm making about 5500 API calls and writing about 6500 rows to the new DataFrame so it takes a while, maybe 10 minutes. I was wondering if anyone knew of a way to speed this up? I'm not too familiar with running parallel for loops in Python, could this be done while maintaining thread safety?

4
  • Since your data-frame is has several dtype=object columns, just using iterrows is going to be about as fast as you can get. You can try threading, since requests are i/o bound. In which case, look at the examples here. Do note, that question was originally posed 7 years ago. So look at the more recent examples. Commented Oct 17, 2017 at 21:12
  • 2
    I would use requests-futures outside of pandas to make async requests, get the results, and rebuild the column after that. Commented Oct 17, 2017 at 21:13
  • 1
    It's not the looping that is an issue. It's the api calls. I have no idea if the web api can handle a bulk query. I'm no expert at what @roganjosh suggested... but that sounds like a good idea. Commented Oct 17, 2017 at 21:13
  • 2
    Also, you should consider not using requests because it is synchronous by design. Check out some options here Commented Oct 17, 2017 at 21:16

1 Answer 1

8

Something along these lines perhaps? This way you aren't creating a whole new dataframe, you're only declaring URL once, and you're taking advantage of the fact that pandas column operations are faster than row by row stuff.

url = 'http://myAPI/'

def request_function(j):
    return requests.post(url,json = json.loads(j))['c'] 

df['j']= '{"SomeJsonData:"' + df['data'] + '}'
df['new_data'] = df['j'].apply(request_function)

Now to prove that using apply in this case ( String data ) is indeed much faster, here's a simple test:

import numpy as np
import pandas as pd
import time

def func(text):
    return text + ' is processed'


def test_one():
    data =pd.DataFrame(columns = ['text'], index = np.arange(0, 100000))
    data['text'] = 'text'

    start = time.time()
    data['text'] = data['text'].apply(func)
    print(time.time() - start)


def test_two():
    data =pd.DataFrame(columns = ['text'], index = np.arange(0, 100000))
    data['text'] = 'text'

    start = time.time()

    for index, row in data.iterrows():
        data.loc[index, 'text'] = row['text'] + ' is processed'

    print(time.time() - start)

Results of string operations on dataframes.

test_one(using apply) : 0.023002147674560547

test_two(using iterrows): 18.912891149520874

Basically, by using the built-in pandas operations of adding the two columns and apply, you should have somewhat faster results, your response time is indeed limited by the API response time. If the results are still too slow, you might what to consider writing an async function that saves the results to a list. Then you send.apply that async function.

Sign up to request clarification or add additional context in comments.

12 Comments

.apply is not faster than a for-loop, generally. Indeed, it is a Python for loop, under the hood.
@juanpa.arrivillaga Please don't spread misinformation. Under the hood pandas performs many optimizations . The apply function ultimately takes advantage of a number of internal optimizations, such as using iterators in Cython. If you disagree with that, try testing it out yourself. Statistics speak louder than words.
True OP could take send the .apply function an async function that appends the results into a file on disk, or adds it to a list, and bam, lightning fast.
You are correct, I was thinking of pandas.Dataframe.apply with axis=1, which does, as far as I can tell, revert to a python for-loop (coming in at about the same time as iterrows). I am actually very surprised at just how fast pd.Series.apply is with string operations. Digging deeper, it seems to actually be the loc based assignment with strings that is tripping everything up. In other words, applying the function, and iterating over the data-frame with itertuples gives something around 0.10 seconds. Still slower, but not 2 orders of magnitude
Yes, but as you see, it's not actually .apply vs itertuples, it's the iloc-based assignment that slows things down. So, if you modify test 2 to append to a list, then use data['text'] = accumulator_list, the performance difference is about 0.9 vs 0.5 on my machine (vs about 20 sec for iloc assignment). This enormous penalty of doing df.iloc[x, y] = z was totally a suprise for me, then again, I essentially never do that.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.