0

I am working on AWS Glue, and so I cannot use Pandas/Numpy, etc currently.

I have a dataframe of records, which I need to process and update to mysql database. I need to check for record availability and already exists, do an insert...on duplicate key. For this reason, I need to loop through the dataframe using native python libraries. All dataframe iterators I found were using pandas, but is there a way to do without pandas?

Please find herewith a sample dataframe:

df1 = sqlContext.createDataFrame([
    ('4001','81A01','Portland, ME','NY'),
    ('4002','44444','Portland, ME','NY'),
    ('4022','33333','BANGALORE','KA'),
    ('5222','88888','CHENNAI','TN')],
    ("zip_code_new", "territory_code_new", "territory_name_new", "state_new"))

I tried the following, but i got an error message, "AttributeError: 'DataFrame' object has no attribute 'values'"

for i in df1.values():
    print i

UPDATE: The following code seem to work with native python to loop through the dataframe. Also, psidom's code also should work, but i could not see the print results.

arr = df1.collect()
  for r in arr:
      print r.zip_code_new

Thanks

1 Answer 1

1

You don't use for loop on spark data frame; It has foreach method to loop through rows; for example, we can print the zip_code_new in each row as follows:

def process_row(r):
    # your sql statement may go here
    print('zip_code_new: ', r.zip_code_new)

df1.foreach(process_row)

#('zip_code_new: ', u'4002')
#('zip_code_new: ', u'5222')
#('zip_code_new: ', u'4022')
#('zip_code_new: ', u'4001')
Sign up to request clarification or add additional context in comments.

4 Comments

hi thanks, I tried similar to above, though execution result says FINISHED, but am not seeing any console output, ie., r.zip_code. The print statement is not executed. Should i import any library for this? thanks. I am using PYSPARK / zeppelin
hmm. I tested this in a local environment. And if you have a real cluster, you won't see the print as it's happening on the worker node.
I used arr = df1.collect(), followed by for r in arr:. This seem to work. Performance wise, will this have any impact? Thanks Psidom
If your data frame is not large, then it's a perfectly workable solution; Otherwise, you might get memory error and it could potentially be slow as everything happens on the driver only.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.