native python - dataframe for loop and insert records to db

Question

I am working on AWS Glue, and so I cannot use Pandas/Numpy, etc currently.

I have a dataframe of records, which I need to process and update to mysql database. I need to check for record availability and already exists, do an insert...on duplicate key. For this reason, I need to loop through the dataframe using native python libraries. All dataframe iterators I found were using pandas, but is there a way to do without pandas?

Please find herewith a sample dataframe:

df1 = sqlContext.createDataFrame([
    ('4001','81A01','Portland, ME','NY'),
    ('4002','44444','Portland, ME','NY'),
    ('4022','33333','BANGALORE','KA'),
    ('5222','88888','CHENNAI','TN')],
    ("zip_code_new", "territory_code_new", "territory_name_new", "state_new"))

I tried the following, but i got an error message, "AttributeError: 'DataFrame' object has no attribute 'values'"

for i in df1.values():
    print i

UPDATE: The following code seem to work with native python to loop through the dataframe. Also, psidom's code also should work, but i could not see the print results.

arr = df1.collect()
  for r in arr:
      print r.zip_code_new

Thanks

akuiper · Accepted Answer · 2018-06-17 04:58:34Z

1

You don't use for loop on spark data frame; It has foreach method to loop through rows; for example, we can print the zip_code_new in each row as follows:

def process_row(r):
    # your sql statement may go here
    print('zip_code_new: ', r.zip_code_new)

df1.foreach(process_row)

#('zip_code_new: ', u'4002')
#('zip_code_new: ', u'5222')
#('zip_code_new: ', u'4022')
#('zip_code_new: ', u'4001')

answered Jun 17, 2018 at 4:58

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Yuva Over a year ago

hi thanks, I tried similar to above, though execution result says FINISHED, but am not seeing any console output, ie., r.zip_code. The print statement is not executed. Should i import any library for this? thanks. I am using PYSPARK / zeppelin

akuiper Over a year ago

hmm. I tested this in a local environment. And if you have a real cluster, you won't see the print as it's happening on the worker node.

Yuva Over a year ago

I used arr = df1.collect(), followed by for r in arr:. This seem to work. Performance wise, will this have any impact? Thanks Psidom

akuiper Over a year ago

If your data frame is not large, then it's a perfectly workable solution; Otherwise, you might get memory error and it could potentially be slow as everything happens on the driver only.

Collectives™ on Stack Overflow

native python - dataframe for loop and insert records to db

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related