My objective is to make a call to an API for each row in a Pandas DataFrame, which contains a List of strings in the response JSON, and creating a new DataFrame with one row per response. My code basically looks like this:
i = 0
new_df = pandas.DataFrame(columns = ['a','b','c','d'])
for index,row in df.iterrows():
url = 'http://myAPI/'
d = '{"SomeJSONData:"' + row['data'] + '}'
j = json.loads(d)
response = requests.post(url,json = j)
data = response.json()
for new_data in data['c']:
new_df.loc[i] = [row['a'],row['b'],row['c'],new_data]
i += 1
This works fine, but I'm making about 5500 API calls and writing about 6500 rows to the new DataFrame so it takes a while, maybe 10 minutes. I was wondering if anyone knew of a way to speed this up? I'm not too familiar with running parallel for loops in Python, could this be done while maintaining thread safety?
dtype=objectcolumns, just usingiterrowsis going to be about as fast as you can get. You can try threading, since requests are i/o bound. In which case, look at the examples here. Do note, that question was originally posed 7 years ago. So look at the more recent examples.requests-futuresoutside of pandas to make async requests, get the results, and rebuild the column after that.requestsbecause it is synchronous by design. Check out some options here